I have found lots of Scrapy tutorials (such as this good tutorial) that all need the steps listed below. The result is a project, with lots of files (project.cfg + some .py files + a specific folder structure).
How to make the steps (listed below) work as a self-contained python file that can be run with python mycrawler.py ?
(instead of a full project with lots of files, some .cfg files, etc., and having to use scrapy crawl myproject -o myproject.json... by the way, it seems that scrapy is a new shell command? is this true?)
Note: here could be an answer to this question but unfortunately it is deprecated and no longer works.
1) Create a new scrapy project with scrapy startproject myproject
2) Define the data structure with Item like this:
from scrapy.item import Item, Field
class MyItem(Item):
title = Field()
link = Field()
...
3) Define the crawler with
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class MySpider(BaseSpider):
name = "myproject"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
...
4) Run with:
scrapy crawl myproject -o myproject.json