Why is piping 'dd' through gzip so much faster than a direct copy?

Question

I wanted to backup a path from a computer in my network to another computer in the same network over a 100 Mbit/s line. For this I did

dd if=/local/path of=/remote/path/in/local/network/backup.img

which gave me a very low network transfer speed of something about 50 to 100 kB/s, which would have taken forever. So I stopped it and decided to try gzipping it on the fly to make it much smaller so that the amount to transfer is less. So I did

dd if=/local/path | gzip > /remote/path/in/local/network/backup.img.gz

But now I get something like 1 MB/s network transfer speed, so a factor of 10 to 20 faster. After noticing this, I tested this on several paths and files, and it was always the same.

Why does piping dd through gzip also increase the transfer rates by a large factor instead of only reducing the bytelength of the stream by a large factor? I'd expected even a small decrease in transfer rates instead, due to the higher CPU consumption while compressing, but now I get a double plus. Not that I'm not happy, but I am just wondering. ;)

score 109 · Accepted Answer · 2014-05-29T09:32:53.850

dd by default uses a very small block size -- 512 bytes (!!). That is, a lot of small reads and writes. It seems that dd, used naively in your first example, was generating a great number of network packets with a very small payload, thus reducing throughput.

On the other hand, gzip is smart enough to do I/O with larger buffers. That is, a smaller number of big writes over the network.

Can you try dd again with a larger bs= parameter and see if it works better this time?

score 7 · Answer 2 · answered Sep 07 '16 at 21:41

Bit late to this but might I add...

In an interview I was once asked what would be the quickest possible method for cloning bit-for-bit data and of coarse responded with the use of dd or dc3dd (DoD funded). The interviewer confirmed that piping dd to dd is more efficient, as this simply permits simultaneous Read/Write or in programmer terms stdin/stdout, thus ultimatly doubling write speeds and halfing transfer time.

dc3dd verb=on if=/media/backup.img | dc3dd of=/dev/sdb

score 0 · Answer 3 · answered Oct 21 '16 at 04:57

I assume here that the "transfer speed" you're referring to is being reported by dd. This actually does make sense, because dd is actually transferring 10x the amount of data per second! However, dd is not transferring over the network -- that job is being handled by the gzip process.

Some context: gzip will consume data from its input pipe as fast as it can clear its internal buffer. The speed at which gzip's buffer empties depends on a few factors:

The I/O write bandwidth (which is bottlenecked by the network, and has remained constant)
The I/O read bandwidth (which is going to be far higher than 1MB/s reading from a local disk on a modern machine, thus is not a likely bottleneck)
Its compression ratio (which I will assume by your 10x speedup to be around 10%, indicating that you're compressing some kind of highly-repetitive text like a log file or some XML)

So in this case, the network can handle 100kB/s, and gzip is compressing the data around 10:1 (and isn't being bottlenecked by the CPU). This means that while it is outputting 100kB/s, gzip can consume 1MB/s, and the rate of consumption is what dd can see.

score 0 · Answer 4 · answered Jun 26 '14 at 23:31

Cong is correct. You are streaming the blocks off of disk uncompressed to a remote host. Your network interface, network, and your remote server are the limitation. First you need to get DD's performance up. Specifying a bs= parameter that aligns with the disks buffer memory will get the most performance from the disk. Say bs=32M for instance. This will then fill gzip's buffer at sata or sas line rate strait from the drives buffer. The disk will be more inclined to sequential transfer giving better through put. Gzip will compress the data in stream and send it to your location. If you are using NFS that will allow the nfs transmission to be minimial. If you are using SSH then you encur the SSH encapsulation and encryption overhead. If you use netcat then you have no encryption over head.

Why is piping 'dd' through gzip so much faster than a direct copy?

4 Answers4

Linked