2

I am seeking a tool to space-efficiently archive a blog that is changing every day or even two or three times a day. I don't mean that individual blog posts change - not regularly anyway - I just mean that new blog entries are added and older entries are shifted down the front page. One problem I see is that it will be inefficient to archive the same blog entry multiple times. Revisions to the same entry should be archived, ideally, but the original need not be since the revision is likely due to an improvement or correction.

It is a blogspot.com blog with text and static images. A linux solution is preferred.

H2ONaCl
  • 1,458
  • 4
  • 22
  • 36

1 Answers1

1

One solution is to store it in a Git repository.

Since Git uses content-based addressing, unchanged files take up negligible additional space in the repository. Revisions also take up little space because it stores diffs. Initially the blobs are stored individually compressed, but Git periodically combines files into packs, which are compressed more effectively. You can also manually invoke that functionality using git gc.

A simple way to fetch the website data is to use wget --mirror. Alternatively, look to see whether the blog site provides an XML API (which would be more space-efficient by avoiding archiving boilerplate HTML). You want to download the pages into the current working tree.

Then, after the download finishes, add and commit everything to the git repository. Hence each commit represents a snapshot in time.

Mechanical snail
  • 7,963
  • 5
  • 48
  • 67