3

Background

I have backups of a website which stores all of it's data into a single file. This file is several gigs large and I have many different backups of this file. Most of the data within is mostly the same plus whatever was added or changed to it.

I want to keep all the concurrent backups I've made through the years in case I find a horrible surprise of data corruption along the line. However storing a 10gig file every month gets expensive.

Seeking Solution

I've often thought about different ways of alleviating this problem. One thought that comes up very often combines the idea of a duplicating file system which doesn't require it's own partitioned volume on a hard drive. Something like what truecrypt does, what it calls, "file hosted containers" which when using the truecrypt program allows you to mount and dismount that volume as a regular hard drive.

Question

Is there a virtual hard drive mounter which uses file-based container which uses data deduplicaiton file system?

(This question is a little awkward to put into words, if you have a better idea on how to ask this question please feel free to help out.)

An Dorfer
  • 1,178
Mallow
  • 339

3 Answers3

2

Use ZFS or BTRFS filesystems or OpenDEDUP.

I should also note that you can create "disks" in files on linux and mount them with the loopback (mount -o loop ...) device; thus them being virtual.

You may be better off just loopback mounting a ZFS formatted file; since ZFS is pretty much the defacto when it comes to deduplication. If you don't know how to do this see here.

1

While it dosen't help for the data you have so far, you really ought to be looking for something like rsnapshot , or even simply rsync to make incremental backups. While deduplication is very shiny, and awesome, needing to check through every block and comparing it then deduplicating similar files is heavy. Doing incremental backups at backup time makes much more sense.

Journeyman Geek
  • 133,878
1

A binary patch utility can produce a patch file which if most of the two files are the same is pretty small. You can pick pairs of files, generate a patch and delete the target and just save the source file plus patch file.

I have used xdelta for this purpose.

xdelta delta JanFile FebFile JanToFebPatch

xdelta delta JanFile MarFile JanToMarPatch

Works well if you do full backup + several incremental based on the full. Various options to speed things up or reduce memory usage.

Brian
  • 9,034