How can we know which URLs can be crawled as robots.txt tells if we don't know to which folder a URL belong to?

Question

I'm going to code a web crawler but before I want to know what is going to be possible to crawl.

Tell me if I'm wrong, but in robots.txt websites indicate folders not URLs that can and can't be crawled, so how can we know to which folder a URL belong to ?

score 0 · Accepted Answer · answered Jan 21 '19 at 14:01

The robots.txt file excludes directory prefixes. For example, if you have a robots.txt excluding a directory /foo, then /foo/bar.html must not be crawled.

For any URL you want to crawl, you have to check whether its path matches one of the directives in the robots file.

See the Google documentation for more info and examples:

The path value is used as a basis to determine whether or not a rule applies to a specific URL on a site. With the exception of wildcards, the path is used to match the beginning of a URL (and any valid URLs that start with the same path).

Note that URLs do not have to indicate actual directories on a server. /download.php?what=thestuff could be functionally equivalent to /download/thestuff and point to the same resource.

How can we know which URLs can be crawled as robots.txt tells if we don't know to which folder a URL belong to?

1 Answers1