wget - difficult - How to download all links from a page including different path ones?

Question

To make explanation easy and a bit entertaining let's imagine I want to download wikipedia pages of all the people mentioned here with one wget command, possibly with reasonable amount of other pages I am not interested in. Please do not close it. If you think it's trivial, try to do it.

score 2 · Answer 1 · edited Mar 20 '17 at 10:04

C:\blah>wget -r -l 1 -w 1 -t 1 -T 5 -nd -k -e "robots=off"  http://en.wi
kipedia.org/wiki/List_of_inventors_killed_by_their_own_inventions

I can't test this quickly, because it will take time to complete, as it downloads a link a second. If it ran fast they might block you. Also, if doing -k then that might run afterwards and not run if you do ctrl-c in the middle, but you could let it run its course or remove -k and -nd and stop it in the middle and see how it goes.

-r -l 1 <--- very crucial, that is very much what your title asks for, to follow the links, download the links. (so yep, including different path links, but if you wanted links on foreign hosts you'd need -H too)

-w 1 -t 1 -T 5 <-- so -w 1 to wait 1 second between each http request otherwise the wikipedia server may get mad and block you or something possibly. as they don't really want anybody spidering their site it seems. -t 1 (retry a link once if it fails).. -T is how long to wait if it can't download a link. If it hits a dead link you don't want it to wait 20 seconds and retry it 20 times. or it will take longer than it should to download the lot. . -w 1 is most important of those as you don't want to get any kind of temporary block from download anything for bogging their server down.

-e "robots=off" <--- this is crucial otherwise it won't work. This gets past wikipedia trying to stop spiders.

-nd <-- not so necessary.. it just collapses directories so just put the files in one directory. you may or not be what you want. You might want to leave it out.

-k <-- convert links so instead of them pointing to webpages online they point to the local files downloaded. The problem is this apparently this might do its thing after the download wget not converting links So that's why I can't just download a bit and really test it. And you could do it manually too. with search and replace.. on your index page List_of_inventors_killed_by_their_own_inventions.htm so anything that says /wiki/James_Douglas,_4th_Earl_of_Morton you could change.. Though that's probably ok.. You could leave out -nd.. So you get all those files in a "wiki" subdirectory. Or you could just move the files into a wiki subdirectory if need be. Either make your directory tree match the links or make the links match your directory tree.

i'm a bit bumbling when downloading websites.. I run into issues.. sometimes I use editpad pro and powergrep to makes changes to html using regular expressions.. converting things myself. it's fiddly. and those programs aren't free but others are. Before then i'd use notepad search and replace on individual files or some free program that can do search and replace on a batch of files. And ms word sometimes cutting blocks alt-drag.. if need be, editing the html. fiddly. But that wget line should get some of the way there.

Sometimes I grep all the links from a page, so I just have a file of links, then I do wget -i fileoflinks then there's no funny business! though i'd also do -w 1 -t 1 -T 2 or something like that.. so it doesn't bog the server down. Also with that method you get no funny business.

wget - difficult - How to download all links from a page including different path ones?

1 Answers1