I am running nutch integrated with Solr for a search engine, the nutch crawl job happens on hadoop. My next requirement is to run a content categorisation job for this crawled content, how can I access the text content that is stored in HDFS for this tagging job, I am planning to run the tagging job with Java, how can I access this content through Java ?
            Asked
            
        
        
            Active
            
        
            Viewed 278 times
        
    2 Answers
0
            
            
        The crawled content is stored in the data file in the segments directory for example:
segments\2014...\content\part-00000\data
The file type is a sequence file. To read it you can use code from the hadoop book or from this answer
0
            
            
        Why don't you use Solr for categorization?
Just write your own plugin and categorize pages before sending them to Solr and store category value in Solr!
        Mohsen ZareZardeyni
        
- 936
 - 7
 - 17