I run a site which provides Subversion & TRAC for lots of open source projects. This works pretty well with one exception: Many search engines do not care about robots.txt and DoS the webserver with many parallel requests in TRAC, for example the TRAC changesets as tar/zip.
As I run a lot of TRAC repositories under one domain, I use wildcards in robots.txt, which according to Google should be permitted:
User-agent: *
Disallow: /*/changeset
Disallow: /*/browser
Disallow: /*/log
Unfortunately even Google does not care about it, although the webmaster tools confirm that the specific URIs should be ignored. And yes, I told them but they didn't really care. For sure others like Yandex do not care about it either.
So plan B is to lock them out in the Apache configuration, a friend of mine gave me some hints about how to do it:
<Directory /foo/bar>
SetEnvIf User-Agent Yandex BlockYandex=1
SetEnvIf User-Agent METASpider BlockMETASpider=1
SetEnvIf User-Agent Mail.ru BlockMailru=1
Order allow,deny
Allow from all
Deny from env=BlockYandex
Deny from env=BlockMETASpider
Deny from env=BlockMailru
</Directory>
Now I try to figure out if I can do something like that with wildcards as well so I don't have to do a <Directory> section for each repository. I found <LocationMatch> in the Apache docs but I am not sure if I can use that as a replacement for <Directory>.
So my question is can I use <LocationMatch> for that and/or does anyone have some better ideas on filtering the bots on server side?