1

I am trying to archive a website that will soon vanish. I tried wget and httrack. The problem is that the website returns PHP errors (database connection error) from time to time and the downloaded page is worthless. In any case the HTTP status is 200 so wget thinks that the download is okay. The error string is predictable and easy to match.

Is there a way to tell wget or httrack that it should re-download if the response contains a particular string/expression? Are there better web archiving tools in 2024?

filo
  • 223

1 Answers1

0

The tool to use is wget with Lua hooks from ArchiveTeam. Lua hook can inspect the whole contents of the downloaded file and instruct wget to download again if a pattern matches.

https://github.com/ArchiveTeam/wget-lua

filo
  • 223