I'm a writing a crawler in Python that crawls all pages in a given domain, as part of a domain-specific search engine . I'am using Django, Scrapy, and Celery for achieving this. The scenario is as follows:
I receive a domain name from the user and call the crawl task inside the view, passing the domain as an argument:
crawl.delay(domain)
The task itself just calls a function that starts the crawling process:
from .crawler.crawl import run_spider
from celery import shared_task
@shared_task
def crawl(domain):
return run_spider(domain)
run_spider starts the crawling process, as in this SO answer, replacing MySpider with WebSpider.
WebSpider inherits from CrawlSpider and I'm using it now just to test functionality. The only rule defined takes an SgmlLinkExtractor instance and a callback function parse_page which simply extracts the response url and the page title, populates a new DjangoItem (HTMLPageItem) with them and saves it into the database (not so efficient, I know).
from urlparse import urlparse
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ..items import HTMLPageItem
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule, CrawlSpider
class WebSpider(CrawlSpider):
name = "web"
def __init__(self, **kw):
super(WebSpider, self).__init__(**kw)
url = kw.get('domain') or kw.get('url')
if not (url.startswith('http://') or url.startswith('https://')):
url = "http://%s/" % url
self.url = url
self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
self.start_urls = [url]
self.rules = [
Rule(SgmlLinkExtractor(
allow_domains=self.allowed_domains,
unique=True), callback='parse_page', follow=True)
]
def parse_start_url(self, response):
return self.parse_page(response)
def parse_page(self, response):
sel = Selector(response)
item = HTMLPageItem()
item['url'] = response.request.url
item['title'] = sel.xpath('//title/text()').extract()[0]
item.save()
return item
The problem is the crawler only crawls the start_urls and does not follow links (or call the callback function) when following this scenario and using Celery. However calling run_spider through python manage.py shell works just fine!
Another problem is that Item Pipelines and logging are not working with Celery. This is making debugging much harder. I think these problems might be related.