86

I want to use Wget to save single web pages (not recursively, not whole sites) for reference. Much like Firefox's "Web Page, complete".

My first problem is: I can't get Wget to save background images specified in the CSS. Even if it did save the background image files I don't think --convert-links would convert the background-image URLs in the CSS file to point to the locally saved background images. Firefox has the same problem.

My second problem is: If there are images on the page I want to save that are hosted on another server (like ads) these wont be included. --span-hosts doesn't seem to solve that problem with the line below.

I'm using: wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories -erobots=off http://domain.tld/webpage.html

user14124
  • 1,101

4 Answers4

125

From the Wget man page:

Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to ā€˜-p’:

wget -E -H -k -K -p http://www.example.com/

Also in case robots.txt is disallowing you add -e robots=off

Giacomo1968
  • 58,727
vvo
  • 1,489
7

The wget command offers the option --mirror, which does the same thing as:

$ wget -r -N -l inf --no-remove-listing

You can also throw in -x to create a whole directory hierarchy for the site, including the hostname.

You might not have been able to find this if you aren't using the newest version of wget however.

k0pernikus
  • 2,652
2

I made Webtography for a similar purpose: https://webjay.github.io/webtography/

It uses Wget and pushes the site to a repository on your GitHub account.

I use these arguments:

--user-agent=Webtography
--no-cookies
--timestamping
--recursive
--level=1
--convert-links
--no-parent
--page-requisites
--adjust-extension
--max-redirect=0
--exclude-directories=blog

https://github.com/webjay/webtography/blob/master/lib/wget.js#L15-L26

webjay
  • 151
2

It sounds like wget and Firefox are not parsing the CSS for links to include those files in the download. You could work around those limitations by wget'ing what you can, and scripting the link extraction from any CSS or Javascript in the downloaded files to generate a list of files you missed. Then a second run of wget on that list of links could grab whatever was missed (use the -i flag to specify a file listing URLs).

If you like Perl, there's a CSS::Parser module on CPAN that may give you an easy means to extract links in this fashion.

Note that wget is only parsing certain html markup (href/src) and css uris (url()) to determine what page requisites to get. You might try using Firefox addons like DOM Inspector or Firebug to figure out if the 3rd-party images you aren't getting are being added through Javascript -- if so, you'll need to resort to a script or Firefox plugin to get them too.

quack quixote
  • 43,504