3

I accidentally deleted a handful of gzipped files from a folder. Thankfully, I had uncompressed them in a different location, and am in the process of restoring them. I had the md5 checksums for the old (now deleted) files, but the checksums for the newly compressed files don't match. Crap.

But... I have another folder that contains similar gzipped files from the same source, and I when I gunzip and then immediately gzip one of those files, the checksum is again different, leading me to suspect that the originator of the files used different parameters for gzip (if there's an alternative explanation, I'd love to hear it).

Is there any way to identify the gzip parameters used so that I can verify that my manipulation hasn't messed up the contents of the files?

kevbonham
  • 133

2 Answers2

3

All these utilities include some meta information that can change with each run so even with identical files you get slightly different ZIPs (and so a different MD5). To compare the contents you have to unzip them.

If you lookup GZIP in Wikipedia, you learn that a GZip file starts with a 10-byte header, containing a magic number (1f 8b), a version number and a timestamp. In other words, each run is guaranteed to give a different file.

xenoid
  • 10,597
3

The standard Unix file utility gives you some basic info about a .gz file, e.g.:

$ file foo.gz
foo.gz: gzip compressed data, was "foo", from Unix, last modified: Tue Aug  1 14:19:21 2017, max compression

As you can see, the header stores the original filename, the OS on which the compression was performed, modification time, and compression level. Note that the original filename might be different if you did something like gzip -c tempfile > foo.gz, in which case the original filename would be tempfile. Or it might not even exist if gzip didn't get an original filename because it read from a stream (e.g., tar czf foo.tar.gz somedir).

So you probably want to get an idea of what factors might be different first. I don't know how important all this really is to you, but you could look at RFC 1952, which gives the file format. You could try different settings, and even hex-edit some of the fields to match the originator's if needed (e.g., different OS).

jjlin
  • 16,120