189

I want to get all the files for a given website at the Internet Archive’s Wayback Machine. Reasons might include:

  • The original author did not archived his own website and it is now offline, I want to make a public cache from it.
  • I am the original author of some website and lost some content. I want to recover it.

How do I do that?

Taking into consideration that the Internet Archive’s Wayback Machine has its own special quirks: Webpage links are not pointing to the archive itself, but to a web page that might no longer be there. JavaScript is used client-side to update the links, but a trick like a recursive wget won't work.

Giacomo1968
  • 58,727
user36520
  • 3,171
  • 3
  • 24
  • 19

9 Answers9

137

I tried different ways to download a site and finally I found the wayback machine downloader - which was built by Hartator (so all credits go to him, please), but I simply did not notice his comment to the question. To save you time, I decided to add the wayback_machine_downloader gem as a separate answer here.

The site at http://www.archiveteam.org/index.php?title=Restoring lists these ways to download from archive.org:

  • Wayback Machine Downloader, small tool in Ruby to download any website from the Wayback Machine. Free and open-source. My choice! As of 2025, the original project hadn't received updates for 3 years and stopped working at least for some sites, StrawberryMaster/wayback-machine-downloader is a fork that worked better.
  • Warrick - Main site seems down.
  • Wayback downloaders - a service that will download your site from the Wayback Machine and even add a plugin for WordPress. Not free.
33

This can be done using a bash shell script combined with wget.

The idea is to use some of the URL features of the wayback machine:

  • http://web.archive.org/web/*/http://domain/* will list all saved pages from http://domain/ recursively. It can be used to construct an index of pages to download and avoid heuristics to detect links in webpages. For each link, there is also the date of the first version and the last version.
  • http://web.archive.org/web/YYYYMMDDhhmmss*/http://domain/page will list all version of http://domain/page for year YYYY. Within that page, specific links to versions can be found (with exact timestamp)
  • http://web.archive.org/web/YYYYMMDDhhmmssid_/http://domain/page will return the unmodified page http://domain/page at the given timestamp. Notice the id_ token.

These are the basics to build a script to download everything from a given domain.

user36520
  • 3,171
  • 3
  • 24
  • 19
21

You can do this easily with wget.

wget -rc --accept-regex '.*ROOT.*' START

Where ROOT is the root URL of the website and START is the starting URL. For example:

wget -rc --accept-regex '.*http://www.math.niu.edu/~rusin/known-math/.*' http://web.archive.org/web/20150415082949fw_/http://www.math.niu.edu/~rusin/known-math/

Note that you should bypass the Web archive's wrapping frame for START URL. In most browsers, you can right-click on the page and select "Show Only This Frame".

jcoffland
  • 439
6

The hartator version wayback-machine-downloader does not work any more, does not get updates any more, has open issues. Luckily someone took over: https://github.com/ShiftaDeband/wayback-machine-downloader (Give him a star if that version works for you, it does for me!)

5

Wayback Machine Downloader works great. Grabbed the pages - .html, .js, .css and all the assets. Simply ran the index.html locally.

With Ruby installed, simply do:

gem install wayback_machine_downloader
wayback_machine_downloader http://example.com -c 5 # -c 5 adds concurrency for much faster downloads

If your connection fails half-way through a large download, you can even run it again and it will regrab any missing pages

5

There is a tool specifically designed for this purpose, Warrick: https://code.google.com/p/warrick/

It's based on the Memento protocol.

Nemo
  • 1,151
3

Interesting enough the official answer is to use a third party service and they list a few services presently:

https://waybackrebuilder.com
http://waybackdownloader.com
http://www.waybackmachinedownloader.com/en/
https://www.waybackdownloads.com
akostadinov
  • 1,530
1

I was able to do this using Windows Powershell.

  • go to wayback machine and type your domain
  • click URLS
  • copy/paste all the urls into a text file (like VS CODE). you might repeat this because wayback only shows 50 at a time
  • using search and replace in VS CODE change all the lines to look like this
Invoke-RestMethod -uri "https://web.archive.org/web/20200918112956id_/http://example.com/images/foobar.jpg" -outfile "images/foobar.jpg"
  • using REGEX search/repl is helpful, for instance change pattern example.com/(.*) to example.com/$1" -outfile "$1"

The number 20200918112956 is DateTime. It doesn't matter very much what you put here, because WayBack will automatically redirect to a valid entry.

  • Save the text file as GETIT.ps1 in a directory like c:\stuff
  • create all the directories you need such as c:\stuff\images
  • open powershell, cd c:\stuff and execute the script.
  • you might need to disable security, see link
John Henckel
  • 131
  • 6
1

Today I found this online service https://archivarix.com

screenshot of archivarix.com

Video about it https://archive.org/details/HowToUseWaybackMachine

Non free.

Vitaly Zdanevich
  • 211
  • 1
  • 4
  • 19