1

I am trying to archive my small advertisements from a small advertisement portal immediately after I have sold the respective article and before I delete the advertisement.

I know that I can't do that using the means that browsers provide by default, unless I can go without interactive elements and without a lot of other stuff, which is not an option. I researched that archiving the pages in WARC format (whether compressed or not) would be a correct way to achieve the goal. I also found that wget can archive web pages in WARC format.

So I studied the wget manual and saw that I would need the -p command line switch to download all requisites that belong to the respective page. The complete explanation of -p is a bit lengthy in the manual and would clutter this post too much. However, the key sentence probably is just the first one:

This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.

So it seemed that I really needed that switch. Therefore I tried:

wget -p <URL of small ad> --warc-file test

This command generated a lot of files in a lot of subdirectories:

-rw-r--r--+ 1           None 1927701 2024-07-14 10:59 test.warc.gz
drwxr-xr-x+ 1           None     0 2024-07-14 10:21 cdnjs.cloudflare.com
drwxr-xr-x+ 1           None     0 2024-07-14 10:21 img.kleinanzeigen.de
drwxr-xr-x+ 1           None     0 2024-07-14 10:21 static.kleinanzeigen.de
drwxr-xr-x+ 1           None     0 2024-07-14 10:21 www.facebook.com
drwxr-xr-x+ 1           None     0 2024-07-14 10:21 www.google.com
drwxr-xr-x+ 1           None     0 2024-07-14 10:21 www.googletagmanager.com
drwxr-xr-x+ 1           None     0 2024-07-14 10:21 www.kleinanzeigen.de

Here we see the WARC file itself (test.warc.gz) and the subdirectories that wget has created. To avoid misunderstandings, I'd like to emphasize that this directory structure is not the unpacked WARC file (actually, I have no clue about how to unpack a warc.gz file). That is, wget itself has created the directory structure independently from creating the WARC file.

When trying to find out how to use the local copy of the page in question, I noticed the following two facts:

  1. test.warc.gz does not contain the archived page including all requisites. I loaded it into webrecorder and examined its contents. While it did contain a lot of stuff, it even didn't contain the main HTML file of the archived page. So it is totally useless, at least on its own.

  2. In contrast, the subdirectories that wget has created seem to contain the complete page including all requisites. Once I had spotted the main HTML file of the archived page in these subdirectories, I could open it in the browser, and it was all there: Images, interactive elements, links and so on. Even after I had deleted test.warc.gz, the local copy of the page remained fully operational (at least, seemingly; I haven't tested yet what happens when I clear the browser cache and unplug the network cable).

Both observations lead to the question what a WARC archive is good for at all:

On one hand, the WARC file on its own is useless because it even doesn't contain the main HTML file, and probably doesn't contain most other important files, too. On the other hand, the subdirectories that wget has created seem to contain the complete page on their own, including requisites, and that copy seems to work flawlessly even after the WARC file has been deleted.

So what exactly is the point of the WARC file? What is it good for?

To my understanding after a lot of research, a WARC file is meant to contain the archived page, including everything that could be needed to display it and interact with it. This was clearly not the case in my tests with wget.

Binarus
  • 2,039
  • 14
  • 27

0 Answers0