Wildcard blocking of bots in Apache

Question

I run a site which provides Subversion & TRAC for lots of open source projects. This works pretty well with one exception: Many search engines do not care about robots.txt and DoS the webserver with many parallel requests in TRAC, for example the TRAC changesets as tar/zip.

As I run a lot of TRAC repositories under one domain, I use wildcards in robots.txt, which according to Google should be permitted:

User-agent: *
Disallow: /*/changeset
Disallow: /*/browser
Disallow: /*/log

Unfortunately even Google does not care about it, although the webmaster tools confirm that the specific URIs should be ignored. And yes, I told them but they didn't really care. For sure others like Yandex do not care about it either.

So plan B is to lock them out in the Apache configuration, a friend of mine gave me some hints about how to do it:

<Directory /foo/bar>
                SetEnvIf User-Agent Yandex BlockYandex=1
                SetEnvIf User-Agent METASpider BlockMETASpider=1
                SetEnvIf User-Agent Mail.ru BlockMailru=1
                Order allow,deny
                Allow from all
                Deny from env=BlockYandex
                Deny from env=BlockMETASpider
                Deny from env=BlockMailru
</Directory>

Now I try to figure out if I can do something like that with wildcards as well so I don't have to do a <Directory> section for each repository. I found <LocationMatch> in the Apache docs but I am not sure if I can use that as a replacement for <Directory>.

So my question is can I use <LocationMatch> for that and/or does anyone have some better ideas on filtering the bots on server side?

score 1 · Accepted Answer · answered Sep 01 '11 at 21:28

(I wanted to answer it inline as recommended by the site but edit of the question lead to an error message)

I checked out more documentation at Apache.org and figured it out myself:

   # get rid of the bots which are too stupid to respect robots.txt
   <LocationMatch "/[^/]+/(browser|changeset|log)">
      BrowserMatchNoCase googlebot ImBot
      BrowserMatchNoCase Yandex ImBot
      BrowserMatchNoCase bingbot ImBot
      Order allow,deny
      Allow from all
      Deny from env=ImBot
   </LocationMatch>

References:

LocationMatch: http://httpd.apache.org/docs/2.2/mod/core.html#locationmatch
BrowserMatchNoCase: http://httpd.apache.org/docs/2.2/mod/mod_setenvif.html#browsermatchnocase

Test with a user-agent-switcher extension for your browser. Due to noob-limitations I cannot post more links to it :) My site is at svn.netlabs.org for those who want to try it live.

Wildcard blocking of bots in Apache

1 Answers1