How to access crawled content from nutch for content categorisation

Question

I am running nutch integrated with Solr for a search engine, the nutch crawl job happens on hadoop. My next requirement is to run a content categorisation job for this crawled content, how can I access the text content that is stored in HDFS for this tagging job, I am planning to run the tagging job with Java, how can I access this content through Java ?

score 0 · Answer 1 · edited May 23 '17 at 11:50

0

The crawled content is stored in the data file in the segments directory for example:

segments\2014...\content\part-00000\data

The file type is a sequence file. To read it you can use code from the hadoop book or from this answer

edited May 23 '17 at 11:50

Community

1
1

answered May 20 '14 at 08:33

Diaa

869
3
7
19

score 0 · Answer 2 · answered May 26 '14 at 05:40

0

Why don't you use Solr for categorization?

Just write your own plugin and categorize pages before sending them to Solr and store category value in Solr!

answered May 26 '14 at 05:40

Mohsen ZareZardeyni

936
7
17

How to access crawled content from nutch for content categorisation

2 Answers2