6

I need to download all the PDF files present on a site. Trouble is, they aren't listed on any one page, so I need something (a program? a framework?) to crawl the site and download the files, or at least get a list of the files. I tried WinHTTrack, but I couldn't get it to work. DownThemAll for Firefox does not crawl multiple pages or entire sites. I know that there is a solution out there, as I couldn't have possibly been the first person to be presented with this problem. What would you recommend?

4 Answers4

10

From http://www.go2linux.org/tips-and-tricks-of-wget-to-download-files:

wget -r -A pdf http://www.site.com
miku
  • 432
  • 4
  • 9
3

Google has an option to return only files of a certain type. Combine this with the "site" option and you have your "crawler".

Example: http://www.google.com/search?q=site:soliddocuments.com+filetype:pdf

Michael
  • 131
2

Use some webcrawling library, eg. in ruby http://www.example-code.com/ruby/spider_begin.asp

Alistra
  • 166
0

If there are no links to PDF files, a crawler won't help and you basically only have two choices:

  1. Get the list from somewhere else (ask the site's Web Master for a list)
  2. Get the list from WebSite's directory listing. Although, if they have disabled this option on their web server, you won't be able to use it.