0

In the past I made a backup of a partially full partition with dd if=/dev/sda1 | gzip -5 > file.gz. Some time later, when a free space on partition was smaller I made an image file again with the same command and the output file is a little smaller.

In both cases I used the same version of dd and gzip, the same parameters, the same hardware, the same partition and I got the same (except time and speed) output from dd about amount of records in/out and copied bytes.

What would caused that and how can it be explained? How to check which image file is invalid assuming that the one of them is? What is more probable: HDD corruption which caused undetected loss of data or that a difference is related to some issues with compression?

2 Answers2

1

It's the nature of compression. How effective it is depends on the input data. Since you compressed different data each time, you end up with different compressed sizes, even though the uncompressed size is the same.

psusi
  • 8,122
0

You seem to think that free space compresses better. There is no such rule.

Common filesystems only mark free space as free, they don't overwrite it with zeros or whatever. The old data is still there until overwritten with something new. (Side note: this is why it's sometimes possible to recover deleted files).

dd reads everything, it knows nothing about filesystems or what they consider free space; then gzip compresses everything, including the old data in "free space" which may compress well or poorly. In this context there is no free space; there's only some data stream to process.

It may be some new "highly-compressible" files replaced old "poorly-compressible" data marked as free space. If so, the new archive will be smaller than the old one, despite the fact it contains more data that you consider useful, actual or existing. This may be the main cause of what you experienced.

Please see Clone only space in use from hard disk, and my answer there. The "preparation" step overwrites empty space with zeros, so it compresses extremely well. If you did this before each backup, the sizes of the resulting archives would probably agree with your intuition.

"Probably", because the other answer to your question is right in general: it all depends on the input data. Even after zeroing the free space, a filesystem that is 60% full may compress to a smaller archive than an equally big filesystem that is 50% full, if the files within are different.