3

I have several directories containing thousands of gzip files (overall we are talking about 1M files). Some of these files are corrupted and most of them are really small in size (a couple of KB).

Almost all of them are highly similar in content, therefore compressing all of them together should improve the compression ratio wrt to the current situation.

Since I rarely browse these directories and I just need to have them around for archival reasons, I need a highly available and highly compressible format and create a single archive. It would be nice to have random access capability to access specific files once in a while without decompressing the whole archive.

What's the best strategy here? Is tar resilient to corruption? I'd prefer something that can be implemented as a one-liner or a simple bash script.

nopper
  • 131

2 Answers2

3

After researching this, the way I would solve the problem would be to uncompress all files, create a list of all the sha256 sums (or whatever hash you prefer), then compress all the files together into a single archive. I'd be inclined to use a tar.gz file for speed and ease of use, but you could use zip, or bzip, 7zip, xz or something else if you want a smaller archive. Compressing all the files into a single large one will save quite a lot of space in its own right.

When that is done, use 'par2' to create redundancy and verification for the zipped file, and back up the file along with the .par2 files. (I've not played with it much, but the purpose of par2 is to create an archive which create redundancy (PARity) to bolster the integrity of the files.

davidgo
  • 73,366
2

Unfortunately, there is no definitive answer to a question like this. Different compression programs and algorithms will have different compression ratios based on the data. If there was a way to know how good compression will be. If there was, dont you think that would be built into all the compression programs?

You say there are thousands of 1MB files, which equates to a number of gigabytes. Lets say you have 5000 files, thats 5GB of data. Lets say zipping on ultra gets you down to 2GB. If you try another program and algorithm, thats 5% better (I would think thats a high estimation), that only saves you 100GB. Not much in the grand scheme.

As for resilience from corruption, there is no such thing. Its possible that one compression program might handle corruption, like failed CRC check, better than another. At best, that might mean only some of your data is lost, rather than all. However again, there is really no way to know. Simply put, there is no replacement for backups of important data.

Keltari
  • 75,447