How do you utilize proxy support with the python web-scraping framework Scrapy?
9 Answers
Single Proxy
Enable
HttpProxyMiddlewarein yoursettings.py, like this:DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1 }pass proxy to request via
request.meta:request = Request(url="http://example.com") request.meta['proxy'] = "host:port" yield request
You also can choose a proxy address randomly if you have an address pool. Like this:
Multiple Proxies
class MySpider(BaseSpider):
    name = "my_spider"
    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.proxy_pool = ['proxy_address1', 'proxy_address2', ..., 'proxy_addressN']
    def parse(self, response):
        ...parse code...
        if something:
            yield self.get_request(url)
    def get_request(self, url):
        req = Request(url=url)
        if self.proxy_pool:
            req.meta['proxy'] = random.choice(self.proxy_pool)
        return req
- 
                    12The documentation says that the `HttpProxyMiddleware` is setting the proxy inside every Requests meta attr, so enabling ProxyMiddleware AND setting it manually would make no sense – Rafael T Dec 22 '14 at 20:16
 - 
                    1I should have copied this code. I glanced it and then coded myself, but proxy functionality was not working. Now I see the proxy value was set to `request.headers` instead of `request.meta`. Stupid me (face palm)! I went to see the `HttpProxyMiddleware` code, it skips if someone has already set `request.meta['proxy']`, so there is no need to list it in the settings https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/httpproxy.py – Thamme Gowda Jul 21 '17 at 03:48
 - 
                    1I am not sure I understand the difference between the two, is `BaseSpider` your original spider and `MySpider` or is `MySpider` is the actual modified spider and `BaseSpider `refers to `scrapy.Spider`? – ishandutta2007 Dec 19 '19 at 10:46
 
From the Scrapy FAQ,
Does Scrapy work with HTTP proxies?
Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See
HttpProxyMiddleware.
The easiest way to use a proxy is to set the environment variable http_proxy.  How this is done depends on your shell.
C:\>set http_proxy=http://proxy:port csh% setenv http_proxy http://proxy:port sh$ export http_proxy=http://proxy:port
if you want to use https proxy and visited https web,to set the environment variable http_proxy you should follow below,
C:\>set https_proxy=https://proxy:port csh% setenv https_proxy https://proxy:port sh$ export https_proxy=https://proxy:port
- 
                    Thanks ... So I need to set this var before running scrapy crawler it's not possible to set it or change it from the crawler code – no1 Jan 17 '11 at 11:59
 - 
                    20You can even set the proxy on a per-request base with: request.meta['proxy'] = 'http://your.proxy.address' – Pablo Hoffman Jan 25 '11 at 19:35
 - 
                    3
 - 
                    1
 - 
                    @ocean800 I use scrapy to scrape a website that shows your current IP to see if it's using the proxy. That way I can load the page via a chrome and see my actual IP and compare it to what scrapy sees on the same page. – Shannon Cole Jun 24 '18 at 12:53
 
1-Create a new file called “middlewares.py” and save it in your scrapy project and add the following code to it.
import base64
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
2 – Open your project’s configuration file (./project_name/settings.py) and add the following code
DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'project_name.middlewares.ProxyMiddleware': 100,
}
Now, your requests should be passed by this proxy. Simple, isn’t it ?
- 8,955
 - 3
 - 53
 - 79
 
- 2,599
 - 22
 - 41
 
- 
                    I implement your solution which looks correct, but I keep getting a Twisted error: twisted.web._newclient.ResponseNeverReceived: [
>] ANY ADVICE??? – ccdpowell May 07 '15 at 01:09 - 
                    3Take care to use `base64.b64encode` instead of `base64.encodestring` as the latter adds a newline character to the encoded base64 result...! See http://stackoverflow.com/a/32243566/426790 – Greg Sadetsky Feb 28 '16 at 03:03
 - 
                    
 - 
                    2
 - 
                    
 
As I've had trouble by setting the environment in /etc/environment, here is what I've put in my spider (Python):
os.environ["http_proxy"] = "http://localhost:12345"
- 
                    Might as well add ```os.environ["https_proxy"]``` to it. Worked for me having both. – James Koss May 10 '19 at 05:08
 
There is nice middleware written by someone [1]: https://github.com/aivarsk/scrapy-proxies "Scrapy proxy middleware"
- 819
 - 1
 - 15
 - 17
 
Here is what I do
Method 1:
Create a Download Middleware like this
class ProxiesDownloaderMiddleware(object):
    def process_request(self, request, spider):
        
        request.meta['proxy'] = 'user:pass@host:port'
and enable that in settings.py
DOWNLOADER_MIDDLEWARES: {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    'my_scrapy_project_directory.middlewares.ProxiesDownloaderMiddleware': 600,
},
That is it, now proxy will be applied to every request
Method 2:
Just enable HttpProxyMiddleware in settings.py and then do this for each request
yield Request(url=..., meta={'proxy': 'user:pass@host:port'})
- 19,358
 - 14
 - 72
 - 146
 
- 
                    
 - 
                    
 - 
                    
 - 
                    1@rom its name of folder where your scrapy project is in, its that simple – Umair Ayub Dec 14 '21 at 05:09
 
In Windows I put together a couple of previous answers and it worked. I simply did:
C:>  set http_proxy = http://username:password@proxy:port
and then I launched my program:
C:/.../RightFolder> scrapy crawl dmoz
where "dmzo" is the program name (I'm writing it because it's the one you find in a tutorial on internet, and if you're here you have probably started from the tutorial).
- 829
 - 12
 - 24
 
I would recommend you to use a middleware such as scrapy-proxies. You can rotate proxies, filter bad proxies or use a single proxy for all your request. Also,using a middleware will save you the trouble of setting up proxy on every run.
This is directly from the GitHub README.
Install the scrapy-rotating-proxy library
pip install scrapy_proxiesIn your settings.py add the following settings
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'
# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0
# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"
Here you can change retry times, set a single or rotating proxy
- Then add your proxy to a list.txt file like this
 
http://host1:port
http://username:password@host2:port
http://host3:port
After this all your requests for that project will be sent through proxy. Proxy is rotated for every request randomly. It will not affect concurrency.
Note: if you donot want to use proxy. You can simply comment the scrapy_proxy middleware line.
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
#    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
Happy crawling!!!
- 911
 - 8
 - 23