I am trying to run boilerpipe with Python multiprocessing. Doing this to parse RSS feeds from multiple sources. The problem is it hangs in one of the threads after processing some links. The whole flow works if I remove the pool and run it in a loop.
Here is my multiprocessing code:
proc_pool = Pool(processes=4)
for each_link in data:
proc_pool.apply_async(process_link_for_feeds, args=(each_link, ), callback=store_results_to_db)
proc_pool.close()
proc_pool.join()
This is my boilerpipe code which is being called inside process_link_for_feeds():
def parse_using_bp(in_url):
extracted_html = ""
if ContentParser.url_skip_p.match(in_url):
return extracted_html
try:
extractor = Extractor(extractor='ArticleExtractor', url=in_url)
extracted_html = extractor.getHTML()
del extractor
except BaseException as e:
print "Something's wrong at Boilerpipe -->", in_url, "-->", e
extracted_html = ""
finally:
return extracted_html
I am clueless on why it is hanging. Is there something wrong in the proc_pool code?