221

I know that tar was made for tape archives back in the day, but today we have archive file formats that both aggregate files and perform compression within the same logical file format.

Questions:

  • Is there a performance penalty during the aggregation/compression/decompression stages for using tar encapsulated in gzip or bzip2, when compared to using a file format that does aggregation and compression in the same data structure? Assume the runtime of the compressor being compared is identical (e.g. gzip and Deflate are similar).

  • Are there features of the tar file format that other file formats, such as .7z and .zip do not have?

  • Since tar is such an old file format, and newer file formats exist today, why is tar (whether encapsulated in gzip, bzip2 or even the new xz) still so widely used today on GNU/Linux, Android, BSD, and other such UNIX operating systems, for file transfers, program source and binary downloads, and sometimes even as a package manager format?

hippietrail
  • 4,605
MarcusJ
  • 2,145

18 Answers18

182

Part 1: Performance

Here is a comparison of two separate workflows and what they do.

You have a file on disk blah.tar.gz which is, say, 1 GB of gzip-compressed data which, when uncompressed, occupies 2 GB (so a compression ratio of 50%).

The way that you would create this, if you were to do archiving and compression separately, would be:

tar cf blah.tar files ...

This would result in blah.tar which is a mere aggregation of the files ... in uncompressed form.

Then you would do

gzip blah.tar

This would read the contents of blah.tar from disk, compress them through the gzip compression algorithm, write the contents to blah.tar.gz, then unlink (delete) the file blah.tar.

Now, let's decompress!

Way 1

You have blah.tar.gz, one way or another.

You decide to run:

gunzip blah.tar.gz

This will

  • READ the 1 GB compressed data contents of blah.tar.gz.
  • PROCESS the compressed data through the gzip decompressor in memory.
  • As the memory buffer fills up with "a block" worth of data, WRITE the uncompressed data into the file blah.tar on disk and repeat until all the compressed data is read.
  • Unlink (delete) the file blah.tar.gz.

Now, you have blah.tar on disk, which is uncompressed but contains one or more files within it, with very low data structure overhead. The file size is probably a couple of bytes larger than the sum of all the file data would be.

You run:

tar xvf blah.tar

This will

  • READ the 2 GB of uncompressed data contents of blah.tar and the tar file format's data structures, including information about file permissions, file names, directories, etc.
  • WRITE the 2 GB of data plus the metadata to disk. This involves: translating the data structure / metadata information into creating new files and directories on disk as appropriate, or rewriting existing files and directories with new data contents.

The total data we READ from disk in this process was 1 GB (for gunzip) + 2 GB (for tar) = 3 GB.

The total data we WROTE to disk in this process was 2 GB (for gunzip) + 2 GB (for tar) + a few bytes for metadata = about 4 GB.

Way 2

You have blah.tar.gz, one way or another.

You decide to run:

tar xvzf blah.tar.gz

This will

  • READ the 1 GB compressed data contents of blah.tar.gz, a block at a time, into memory.
  • PROCESS the compressed data through the gzip decompressor in memory.
  • As the memory buffer fills up, it will pipe that data, in memory, through to the tar file format parser, which will read the information about metadata, etc. and the uncompressed file data.
  • As the memory buffer fills up in the tar file parser, it will WRITE the uncompressed data to disk, by creating files and directories and filling them up with the uncompressed contents.

The total data we READ from disk in this process was 1 GB of compressed data, period.

The total data we WROTE to disk in this process was 2 GB of uncompressed data + a few bytes for metadata = about 2 GB.

If you notice, the amount of disk I/O in Way 2 is identical to the disk I/O performed by, say, the Zip or 7-Zip programs, adjusting for any differences in compression ratio.

And if compression ratio is your concern, use the Xz compressor to encapsulate tar, and you have LZMA2'ed TAR archive, which is just as efficient as the most advanced algorithm available to 7-Zip :-)

Part 2: Features

tar stores Unix permissions within its file metadata, and is very well known and tested for successfully packing up a directory with all kinds of different permissions, symbolic links, etc. There are more than a few instances where one might need to glob a bunch of files into a single file or stream, but not necessarily compress it (although compression is useful and often used).

Part 3: Compatibility

Many tools are distributed in source or binary form as .tar.gz or .tar.bz2, because it is a "lowest common denominator" file format: much like most Windows users have access to .zip or .rar decompressors, most Linux installations, even the most basic, will have access to at least tar and gunzip, no matter how old or pared down. Even Android firmwares have access to these tools.

New projects targeting audiences running modern distributions may very well distribute in a more modern format, such as .tar.xz (using the Xz (LZMA) compression format, which compresses better than gzip or bzip2), or .7z, which is similar to the ZIP or RAR file formats in that it both compresses and specifies a layout for encapsulating multiple files into a single file.

You don't see .7z used more often for the same reason that music isn't sold from online download stores in brand new formats like Opus, or video in WebM. Compatibility with people running ancient or very basic systems.

allquixotic
  • 34,882
101

This has been answered on Stack Overflow.

bzip and gzip work on single files, not groups of files. Plain old zip (and pkzip) operate on groups of files and have the concept of the archive built-in.

The *nix philosophy is one of small tools that do specific jobs very well and can be chained together. That's why there's two tools here that have specific tasks, and they're designed to fit well together. It also means you can use tar to group files and then you have a choice of compression tool (bzip, gzip, etc).

Many tools are distributed in source or binary form as .tar.gz or .tar.bz2, because it is a "lowest common denominator" file format: much like most Windows users have access to .zip or .rar decompressors, most Linux installations, even the most basic, will have access to at least tar and gunzip, no matter how old or pared down. Even Android firmwares have access to these tools.

New projects targeting audiences running modern distributions may very well distribute in a more modern format, such as .tar.xz (using the Xz (LZMA) compression format, which compresses better than gzip or bzip2), or .7z, which is similar to the ZIP or RAR file formats in that it both compresses and specifies a layout for encapsulating multiple files into a single file.

You don't see .7z used more often for the same reason that music isn't sold from online download stores in brand new formats like Opus, or video in WebM. Compatibility with people running ancient or very basic systems is important.

Kruug
  • 5,250
65

Tar has a rich set of operations and modifiers that know all about Unix files systems. It knows about Unix permissions, about the different times associated with files, about hard links, about softlinks (and about the possibility that symbolic links could introduce cycles in the filesystem graph), and allow you to specify several different ways for managing all this data.

  • Do you want the extracted data to preserve file access times? Tar can do that. To preserve permissions? Tar can do that.

  • Do you want to preserve symbolic links as symbolic links? Tar does that by default. Want to copy the target instead? Tar can do that.

  • Do you want to be sure hardlinked data is only stored once (that is, to do the right thing)? Tar does that.

  • Do you want to handle sparse files well? Tar can do that.

  • Do you want uncompressed data (why?)? Tar can do that. To compress with gzip? Tar can do that. With bzip2? Tar can do that. With arbitrary external compression programs? Tar can do that.

  • Do you want to write or recover to/from a raw device? Tar's format handles that fine.

  • Do you want to add files to an existing archive? Tar can do that. To diff two archive to see what changed? Tar can do that. To update only those parts of the archive that have changed? Tar can do that.

  • Do you want to be sure you don't archive across more than one filesystem? Tar can do that.

  • Do you want to grab only files that are newer than your last backup? Tar can do that.

  • Do you want to preserve user and group names or numbers? Tar can do either one.

  • Do you need to preserve device nodes (like the files in /dev) so that after extraction, the system will run correctly? Tar can do that.

Tar has been evolving to handle lots and lots of use cases for decades and really does know a lot about the things people want to do with Unix filesystems.

Malvineous
  • 2,798
31

You confuse the two distinct processes of archiving and compression.

Reasons for using an archiver

One reason to use archiving without compression is, for instance, if a bunch of files is copied from one host to another. A command like the following

tar cf - some_directory | ssh host "(cd ~/somewhere | tar xf -)"

can speed up things considerably. If I know that the files cannot be compressed or if SSH is set up with compression, it can save considerable CPU time. Sure, one can use a more modern compressing tool with an archiving function and turn off the compression. The advantage of tar is, that I can expect it to be available on every system.

Reasons for using an archiver with gzip compression

One reason that I use tar with gzip is: speed! If I want to transfer a few GiB of text files from one place to another, I don't care about squeezing out the last bytes, since the compression is only used for transit, not for long-term storage. In those cases I use gzip, which doesn't max out the CPU (in contrast to 7-Zip, for instance), which means that I'm I/O bound again and not CPU bound. And again: gzip can be considered available everywhere.

Reasons for using tar in favour of scp, rsync, etc.

It beats scp if you have a lot of small files to copy (for example, a mail directories with hundred thousands of files). rsync, awesome as it is, might not be available everywhere. Further, rsync only really pays off if part of the files - or an older version- - is already present on the destination. For the initial copy tar is the fastest, with compression or without, depending on the actual data.

Marco
  • 4,444
25

Adding to the other good answers here, I prefer the combination tar + gzip|bzip2|xz mainly because these compressed files are like streams, and you can pipe them easily.

I need to uncompress a file available in the internet. With either zip or rar formats I have to download it first and then uncompress it. With tar.{gz,bz2,xz} I can download and uncompress in the same step, without need to having the compressed archive phisically on disk:

curl -s http://example.com/some_compressed_file.tar.gz | tar zx

This will leave just the uncompressed files in my disk, and will speed up the whole process, because I am not waisting time first downloading the entire file and after the download finishes I uncompress it. Instead, I am uncompressing it while it is downloading. You cannot do this with zip or rar files.

14

There are several reasons to stick with (GNU) Tar.

It is:

  • GPL licensed
  • good in the sense of Unix philosophy
    • single purpose tool, capable of doing several tasks
  • well documented and has many trusted features
  • compatible with several compression algorithms
  • easy to use and people have have developed habits with it
  • broadly available
  • I feel warm and fuzzy inside when using software started by RMS (excluding Emacs)

If your particular beef is with having to "decompress" a tarball before being able to read the contents, then you're probably right. WinRAR and 7-Zip do it automatically. However, there are simple workarounds to this problem such as documenting the content of an archive in an uncompressed form.

12

Performance

The big difference is the order that the compression and archiving are done in. tar archives, then can optionally send the archive to a compressor, and zip builds up the archive, and compresses the file data in 32 KB chunks as it is inserted into the archive. By breaking the file data up into small chunks and compressing them separately, it allows you to extract specific files or parts of files without having to decompress everything in the archive before it. It also prevents the compressor from building up a very large dictionary before it is restarted. This means compression will go faster, but not give as good of a ratio as compressing the whole thing with a larger dictionary size.

You can visualize it by thinking of two files, where the first 500 bytes of the second file are the same as the last 500 bytes of the first file. With the zip method, the compressor is restarted for the second file, so does not remember that the first file ended in the same data, so it can't remove the duplicate data from the second file.

Popularity

There are plenty of other formats that have a number of advantages over tar. 7-Zip doesn't store Unix file permissions, but dar does, and zip can, and all three store an index, which allows for fast browsing, extraction of a subset of files, and updating files within the archive. They can also use multi-core CPUs for compression.

The reason everyone still uses tar is the same reason everyone still uses Windows, and Flash: people don't like change. Without a strong reason to change, people just stick to what they know. dar doesn't provide enough of a benefit to justify publishing files in the format when most people already have tar installed, and very few know about dar, so simple inertia keeps us on the old standard.

psusi
  • 8,122
11

File formats like .zip require the software to read the end of the file first, to read a catalog of filenames. Conversely, tar stores that information in with the compressed stream.

The advantage of the tar way is that you can decompress data whilst reading it from a non-seekable pipe, like a network socket.

The advantage of the zip way is that, for a static file on disk, you can browse the contents and metadata without decompressing the whole archive first.

Both have their uses, depending on what you're doing.

xorsyst
  • 529
11

There seems to be some reluctance to answer all of your questions directly, with an apparent preference to use your question as a jumping off point for pontification. So I'll give it a shot.

Is there a performance penalty during the aggregation/compression/decompression stages for using tar encapsulated in gzip or bzip2, when compared to using a file format that does aggregation and compression in the same data structure? Assume the runtime of the compressor being compared is identical (e.g. gzip and Deflate are similar).

No. In fact since tar and gzip are usually two processes, you even get a smidge of multi-core speed benefit that an archiver like Info-ZIP's zip does not provide. In terms of compression ratio, tar+gzip will usually do noticeably better than zip with deflate since the former can benefit from correlation between files, whereas the latter compresses files separately. That compression benefit translates into a speed benefit when extracting, since a more-compressed archive decompresses in less time.

Are there features of the tar file format that other file formats, such as .7z and .zip do not have?

Yes, tar was designed for Unix, and has evolved over the years to be able to exactly record and restore every odd little nook and cranny of Unix file systems, even the nookier and crannier Mac OS X Unix file system. zip is able to retain much of the metadata such as permissions, times, owners, groups, and symbolic links, but still not everything. As an example, neither zip nor 7z can recognize or take advantage of sparse files, nor are they aware of or able to restore hard links.

Since tar is such an old file format, and newer file formats exist today, why is tar (whether encapsulated in gzip, bzip2 or even the new xz) still so widely used today on GNU/Linux, Android, BSD, and other such UNIX operating systems, for file transfers, program source and binary downloads, and sometimes even as a package manager format?

Lots of other good answers here to that. The best is that it just works, and you can keep updating it to better compression formats (e.g. xz) and still use the same tar format and even the same compiled tar utility. If you just want to pack up a bunch of stuff, and then unpack it all on the other end, then there is little reason to use anything but one of the oldest, most complete, and most debugged pieces of software out there.

If you want random access, partial updates, or other things that need to deal with the contents piecemeal, or you want to be able to find out what's in it without reading the whole thing, then you would want to use a different format.

9

Tar was created for doing backup full-fidelity backups of your filesystem, not just for transferring files around. As such, the tar utility is the most complete utility for creating an archive that preserves everything important about your filesystem structure.

This includes all these features that are missing in one or more competing tools:

  • file ownership
  • file permissions
  • less-common file permissions (e.g. setuid, sticky bit)
  • symbolic links
  • hard links
  • device entries (i.e. character and block devices)
  • sparse files
  • ACL entries (not supported by all versions)
  • extended/user attributes (not supported by all versions)
  • SElinux labels (not supported by all versions)

It also has the --one-file-system option which is extraordinarily useful when making backups.

Any time a new feature is added to filesystems, support gets added to tar first (or even exclusively). So it continues to be the most compatible way to save files.

tylerl
  • 2,185
5

We have lots of compressed files floating around today, MP3s, JPGs, Videos, tar.gz files, JAR packages, RPMs, DEBs and so on. If you need to bundle a bunch of these into a single file for transfer, then it is useful to have a 'tar' utility which only bundles the files without attempting to compress them.

Not only does it waste time and electricity to attempt to compress a compressed file, but it often results in a file which is bigger than the original.

Another use of it is to improve compression rates. For instance, if you 'tar' a bundle of log files and then gzip the result, you likely come up with a smaller file than if you compressed first, then bundled with 'tar'. And of course, using tar, you can choose any compression algorithm that you want, and specify options to optimize compression for your particular use case.

I find that tar' is very relevant today and I prefer it to use ZIP. In our office, everyone with Windows has 7-zip installed so, for us, tar files are fully cross-platform compatible.

4

Maybe we should wonder why such "new" file formats performing both compression and aggregation (and I would add encryption) where not built on tar from the beginning instead of completely different tools.

As I understand it there are historical reasons (related to OS history, patents "protection", ability for software vendore to sell tools, etc.).

Now, as other response pointed it even now tar is not clearly inferior to other solutions and may be better on other aspects like ability to work on streams or Unix rights management.

If you read the wikipedia article about tar you can see another interesting fact. The article acknowledge some shortcomings of tar... but does not suggest using zip instead (really zip format does not solve these shortcomings) but DAR.

I will end with a personal touch. Some times ago I had to create a file format for storing encrypted data. Using tar as a basis was handy (others made the same choice, for instance tar is the internal aggregation format for .deb packages). It was obvious to me that trying to compress data after encryption was totally useless, I had to perform compression as an independant step before encryption, and I was not either ready to use zip encryption (I wanted two key encryption with public and private keys). Using tar it worked as a breeze.

kriss
  • 221
3

The reason is "entrenchment in the culture". There are numerous people like me whose eyes glaze over if they are asked to process anything other than a compressed tar archive, or the occasional ZIP, if it came from the Windows world.

I don't want to hear about 7-Zip, RAR or anything else. If I have to install a program to uncompress your file, that is work. I will do it if it results in me being paid, or if the content is something I "must have" and isn't available in any other way.

One advantage of tar is that if you send someone a tarball, it is instantly recognized. The recipient can type the extraction commands using muscle memory.

The real question is: why are some people so obsessed with saving one more byte of space that they ask everyone else to waste time installing some exotic utility and learning how to use it? And then there are the stupid uses of exotic compression and archive formats. Does a H.264 video with AAC sound really need to be put into mult-part RAR?

The tar format may be old, but it stores everything that is relevant: file contents, paths, timestamps, permissions and ownerships. It stores not only symbolic links, but it can preserve hard link structure. It stores special files also, so a tape archive can be used for things like a minature /dev directory that is used during bootstrapping. You can put a Linux distribution together whose binary package format consists of nothing, but tarballs that are uncompressed relative to the filesystem root.

Kaz
  • 2,800
  • 1
  • 20
  • 24
3

I'm surprised no one has mentioned this, but one of the reasons—not really an advantage, but a necessity—is for backwards compatibility. There are plenty of systems running software for decades that might call tar for archiving. It's not cost effective to hire someone to "fix" all the old systems.

ctype.h
  • 863
Keltari
  • 75,447
3

tar is UNIX as UNIX is tar

In my opinion the reason of still using tar today is that it's one of the (probably rare) cases where the UNIX approach just made it perfectly right from the very beginning.

Taking a closer look at the stages involved in creating archives I hope you'll agree that the way the separation of different tasks takes place here is UNIX philosophy at its very best:

  • one tool (tar to give it a name here) specialized in transforming any selection of files, directories and symbolic links including all relevant meta-data like timestamps, owners and permissions into one byte stream.

  • and just another arbitrarily interchangeable tool (gzip bz2 xz to name just a few options) that transforms any input stream of bytes into another (hopefully) smaller output stream.

Using such and approach delivers a whole couple of benefits to the user as well as to the developer:

  • extensibility Allowing to couple tar with any compression algorithm already existing or any compression algorithm yet still to be developed without having to change anything on the inner workings of tar at all.

    As soon as the all brand new "hyper-zip-utra" or whater compression tool comes out you're already ready to use it embracing your new servant with the whole power of tar.

  • stability tar has been in heavy use since the early 80es tested and been run on numberous operating systems and machines.

    Preventing the need to reinvent the wheel in implementing storing ownership, permissions, timestamps and the like over and over again for every new archiving tool not only saves a lot of (otherwise unnecessarily spent) time in development, but also guarantees the same reliability for every new application.

  • consistency The user interface just stays the same all the time.

    There's no need to remember that to restore permissions using tool A you have to pass option --i-hope-you-rember-this-one and using tool B you have to use --this-time-its-another-one while using tool C it's `--hope-you-didnt-try-with-tool-as-switch.

    Whereas in utilizing tool D you would have really messed up it if you didn't use --if-you-had-used-tool-bs-switch-your-files-would-have-been-deleted-now.

mikyra
  • 466
3

Lots of good answers, but they all neglect an important fact. Tar has a well-established ecosystem of users and developers in the Unix-like world. That keeps it going, just as ZIP is kept going by its DOS/Windows ecosystem. Having such an ecosystem is what sustains a technology, not its technical advantages.

2

Directly answering the specific questions you posed:

Is there a performance penalty during the aggregation/compression/decompression stages for using tar encapsulated in gzip or bzip2, when compared to using a file format that does aggregation and compression in the same data structure? Assume the runtime of the compressor being compared is identical (e.g. gzip and Deflate are similar).

There is a specific performance improvement, in general cases, using tar especially with the compression library built in (the tar xvzf or tar xvjf style command lines, where a compression library is used rather than a second process). This comes from two main causes:

  • when processing a large number of relatively small files, especially those commonly used in distributing software, there is high redundancy. Compressing over many files results in higher overall compression than compressing individual files. And the "dictionary" is computed once for every chunk of input, not for each file.

  • tar understands file systems. It is designed to save and restore a working/workable operating system. It deeply grasps exactly what is important on a UNIX file system, and faithfully captures and restores that. Other tools... not always, especially the zip family, which is better designed for sharing files amongst a family of OSs, where the document is the important thing, not a faithful OS sensitive copy.

Are there features of the tar file format that other file formats, such as .7z and .zip do not have?

Sparse file handling. Some of the direct database libs rely on sparse files - files where the data is nominally GB, but the actual data written and stored is much, much less, and only a few blocks of disk are actually used. If you use an unaware tool, then on decompressing, you end up with massive disk block consumption, all containing zeroes. Turning that back into a sparse file is... painful. If you even have the room to do it. You need a tool that grasps what a sparse file is, and respects that.

Metadata. Unix has evolved some strange things over the years. 14 character file names, long file names, links sym links, sticky bits, superuser bits, inherited group access permissions, etc. Tar understands and reproduces these. File sharing tools... not so much. A lot of people don't use links the way they could... If you've ever worked with software that does use links, and then used a non-aware tool to back up and restore, you now have a lot of independent files, instead of a single file with many names. Pain. Your software fails and you have disk bloat.

Since tar is such an old file format, and newer file formats exist today, why is tar (whether encapsulated in gzip, bzip2 or even the new xz) still so widely used today on GNU/Linux, Android, BSD, and other such UNIX operating systems, for file transfers, program source and binary downloads, and sometimes even as a package manager format?

tar works. It does the job it is designed for, well. There have been other touted replacements (cpio, pax, etc, etc). But tar is installed on pretty much everything, and the compression libs it uses are also very common for other reasons. Nothing else has come along that substantially beats what tar does. With no clear advantages, and a lot of embedded use and knowledge in the community, there will be no replacement. Tar has had a lot of use over the years. If we get major changes in the way that we think of file systems, or non-text files somehow become the way to transfer code (can't currently imagine how, but ignore that...) then you could find another tool. But then that wouldn't be the type of OS that we now use. It'd be a different thing, organised differently and it would need its' own tools.

The most important question, I think, that you didn't ask, is what jobs 'tar' is ill-suited to.

tar with compression is fragile. You need the entire archive, bit for bit. In my experience, it is not resilient. I've had single bit errors result in multi-part archives becoming unusable. It does not introduce redundancy to protect against errors (which would defeat one of the questions you asked, about data compression). If there is a possibility of data corruption, then you want error checking with redundancy so you can reconstruct the data. That means, by definition, that you are not maximally compressed. You can't both have every bit of data of being required and carrying its maximum value of meaning (maximum compression) and have every bit of data being capable of loss and recovery (redundancy and error correction). So... what's the purpose of your archive? tar is great in high reliability environments and when the archive can be reproduced from source again. IME, it's actually worse at the original thing its' names suggests - tape archiving. Single bit errors on a tape (or worse, single bit errors in a tape head, where you lose one bit in every byte a whole tape or archive) result in the data becoming unusable. With sufficient redundancy and error detection and correction, you can survive either of those problems.

So... how much noise and corruption is there in the environment you're looking at, and can the source be used to regenerate a failed archive? The answer, from the clues that you've provided, is that the system is not noisy, and that source is capable of regenerating an archive. In which case, tar is adequate.

tar with compression also doesn't play well with pre-compressed files. If you're sending around already compressed data... just use tar, and don't bother with the compression stage - it just adds CPU cycles to do not much. That means that you do need to know what you're sending around and why. If you care. If you don't care about those special cases, then tar will faithfully copy the data around, and compress will faithfully fail to do much useful to make it smaller. No big problem, other than some CPU cycles.

JezC
  • 580
-3

TAR is Tape Archive. It's been around for decades and it's widely used and supported. It is a mature product and takes care of the current needs as well as legacy ones.

Edward
  • 529