What do I use to download all PDFs from a website?

Question

I need to download all the PDF files present on a site. Trouble is, they aren't listed on any one page, so I need something (a program? a framework?) to crawl the site and download the files, or at least get a list of the files. I tried WinHTTrack, but I couldn't get it to work. DownThemAll for Firefox does not crawl multiple pages or entire sites. I know that there is a solution out there, as I couldn't have possibly been the first person to be presented with this problem. What would you recommend?

score 10 · Answer 1 · answered Jul 07 '10 at 11:57

10

From http://www.go2linux.org/tips-and-tricks-of-wget-to-download-files:

wget -r -A pdf http://www.site.com

answered Jul 07 '10 at 11:57

miku

432
4
9

score 3 · Answer 2 · answered Jul 09 '10 at 20:07

3

Google has an option to return only files of a certain type. Combine this with the "site" option and you have your "crawler".

Example: http://www.google.com/search?q=site:soliddocuments.com+filetype:pdf

answered Jul 09 '10 at 20:07

Michael

131

score 2 · Answer 3 · answered Jul 07 '10 at 12:00

2

Use some webcrawling library, eg. in ruby http://www.example-code.com/ruby/spider_begin.asp

answered Jul 07 '10 at 12:00

Alistra

166

score 0 · Answer 4 · answered Jul 07 '10 at 11:58

If there are no links to PDF files, a crawler won't help and you basically only have two choices:

Get the list from somewhere else (ask the site's Web Master for a list)
Get the list from WebSite's directory listing. Although, if they have disabled this option on their web server, you won't be able to use it.

What do I use to download all PDFs from a website?

4 Answers4