13

I have a number of web sites I am archiving in order to retain many of the linked files there, specifically a number of PDFs.

I haven't had a problem using the Heritrix crawler to collect the sites. However I haven't found a good solution to extracting the files from these .warc files.

Does any one have experience with this, or have a preferred way to get these individual files out?

wxs
  • 245

7 Answers7

7

You could browse the WARC with Webarchive Player and save the files you want from your browser. Alternatively, upload the WARC to webrecorder.io and browse/download there.

7

ReplayWeb.page replaces Webrecorder Player which replaced WebArchivePlayer.

No app to install, just go to the page and browse to your file. All processing is local.

Andrew Olney
  • 229
  • 3
  • 2
5

I suggest to try warctools https://github.com/internetarchive/warctools it's python lib that is very easy to use.

4

I've found that 7-Zip by itself often doesn't work, but there is a plugin called eDecoder for it that can be used to enable warc support.

eDecoder can be downloaded for free from here.

Upon opening a warc with this plugin installed, it acts like any other archive in 7-Zip with a few exceptions:

  • an extra column is added that shows the original URL of each file.
  • each file gets prepended with a number to prevent filename collisions (eg index.html could get renamed to 000123 index.html).
  • folder structures are discarded, all of the files are visible in the main view regardless of what folder they were in originally, and there are in fact no folders at all.

While it can be downloaded for free, it does seem to be closed source, both in terms of the code and the license, and is therefore limited to Windows due to it being a compiled DLL.

3

I am using this project: https://github.com/chfoo/warcat

Example Run:

python3 -m warcat --help
python3 -m warcat list example/at.warc.gz
python3 -m warcat verify megawarc.warc.gz --progress
python3 -m warcat extract megawarc.warc.gz --output-dir /tmp/megawarc/ --progress
xLight
  • 31
  • 2
1

I was looking for a solution suitable for terminal usage (Ubuntu). Unfortunately in my case (warc files created by using browsertrix-crawler) the previous answers did not work out.

I found warc-extractor to work best in my case. It is a python tool and extracting all HTML pages is as easy as calling:

$ warc-extractor http:content-type:text/html -dump content -error

in the directory containing the warc files. I needed the -error flag as my crawls contain quite a few problematic pages. For my usecase it is sufficient to successfully extract the major part which this tool does well enough.

Michael
  • 11
1

I've used 7-Zip before to extract individual files or whole archives from Web Archive format files.

It's available from their site here.

Martin
  • 111