IMPORTANT NOTE: Compression is NOT the goal, archiving/taping (packing all of the files into a single archive) is the goal.
I want to backup a single directory, which contains hundreds of sub-directories and millions of small files (< 800 KB). When using rsync to copy these files from one machine to another remote machine, I have noticed that the speed of transfer is painfully low, only around 1 MB/sec, whereas when I am copying huge files (e.g. 500 GB) the transfer rate is in fact around 120 MB/sec. So the network connection is not the problem whatsoever.
In such a case moving only 200 GB of such small files has taken me about 40 hours. So I am thinking of compressing the entire directory containing these files and then transferring the compressed archive to the remote machine, afterwards uncompressing it on the remote machine. I am not expecting this approach to reduce 40 hours to 5 hours, but I suspect it would definitely take less than 40 hours..
I have access to a cluster with 14 CPU cores (56 threads -- Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz) and 128 GB RAM. Therefore, CPU/RAM power is not a problem.
But what is the fastest and most efficient way to create a single archive out of so many files? I currently only know about these approaches:
- traditional
tar.gzapproach 7zippigz(parallel gzip - https://zlib.net/pigz/)
However, I do not know which is faster and how the parameters should be tuned to achieve maximum speed? (for example, is it better to use all CPU cores with 7zip or just one?)
N.B. File size and compression rate do NOT matter at all. I am NOT trying to save space at all. I am only trying to create a single archive out of so many files so that the rate of transfer will be 120 MB/s instead of 1 MB/s.
RELATED: How to make 7-Zip faster