28

In general, I've observed the following:

  • Linux-y files or tools use bzip2 or gzip for distributing archives
  • Windows-y files or tools use ZIP for distributing archives
  • Many people use 7-Zip for creating and distributing their own archives

Questions:

  • What are the advantages and disadvantages of these formats, all of which appear to be open formats? When/why should I choose one (say, 7-Zip) over another (say, ZIP)?
  • Why does the trend above appear to hold, even though all of these are portable formats? Are there any particular advantages to using a particular archive format on a particular platform?
user541686
  • 23,629

6 Answers6

17

There are a large variety of compression formats and methods available, some don't compress at all and are designed to store a number of files in one archive, and other newer experimental compressors (PAQ based) are designed to compress as aggressively as possible, regardless of the time it takes to perform said operation.

You need to evaluate the features you require from your compression method choice, and also consider the context in which it will be used.

Different features and considerations include:

  • Compression ability - Does it shrink the file significantly enough?
  • Ease-of-use - If the file is going to another user, will the archive be easy to extract or will it require more software to be installed?
  • Password protection and/or encryption - Are these security measures required?
  • Multiple volumes support - If the target medium requires the file to be split into appropriate chunks, does the format support this elegantly. For example, 650 MB for a CD.
  • Repairing and recovery - If the file becomes partially corrupt, does it offer a recovery record to aid restoration of data?
  • Unicode support - Does the archiver support international file names or just standard ASCII?
  • System Requirements - Modern compressors such as 7-Zip do offer the ability to increase compression efficiency by using a larger dictionary (a dictionary is a reference of commonly repeated data in a compressed file), but this in turn increases memory consumption at both compression and decompression time.
  • Self-extraction support - Can the archive be rolled into an executable file that provides ease of use to whomever needs to use it? (Also bear in mind you can only create a self-extractor for a single platform. Generally speaking a Windows self-extractor will not work on Linux by default, unless run through a compatibility layer like Wine).
  • File system attributes - Does the compressor store relevant file system metadata and permissions that may be worth preserving at point of extraction?

Generally speaking ZIP is the most ubiquitous format, but sizes over 4 GB aren't generally supported (if at all), security support is generally regarded as poor (the standard password can be compromised with a plain-text attack, and further encryption is generally implemented as an unofficial derivative of the format by commercial ZIP software vendors).

Apart from that, most other popular formats will have some form of support on all operating systems by installing more software.

My personal choice is 7-Zip, as it has great and flexible compression; despite it having a peculiar user interface on Windows. There are de-compressors for Linux and Mac OS X (although not GUI based as standard).

8

One things that comes to mind is a (two year old) blog post from Jeff Atwood: File Compression in the Multi-Core Era. In that article he finds that bzip2 outperforms 7-zip when running more than two cores.

matpe
  • 81
6

As others have mentioned, the choice of a particular compression format is heavily dependent on the use and the intended audience.

  • .tar.gz and tar.bz2 archives are ideal for use on Linux systems (and by extension for sharing files with Linux users) because the tar, gzip and bzip2 tools are largely ubiquitous on the platform, and because the .tar format has full support for Unix permissions and other platform-specific properties. The choice between gzip and bzip2 to compress the tar archive is mainly a decision about speed versus compression ratio, with bzip2 delivering smaller files but with a much slower compression speed. The disadvantages of these formats include less compatibility with Windows and the (potential) need to uncompress the entire archive to extract a single file.

  • ZIP archives can be extracted on most platforms using native tools, so it is an ideal choice for sending an archive to a non-technical user who would be uncomfortable with installing third-party archive software such as 7-Zip. The compression level isn't as good as more advanced algorithms and it doesn't support Unix permissions, but it is an excellent format if you wanted to send an archive of holiday photos to your grandmother, for example. ZIP also provides some basic password protection, and can quickly extract a file from anywhere in the archive.

  • 7-Zip is good if you want the best possible compression ratios. Like ZIP, it doesn't support Unix file permissions or ownership, and is also not installed by default on most platforms which makes it slightly more work to use, but it may be worth it on Windows if the compression ratio gains are important. In an all-Linux environment it would be better to use the 'xz' or 'lzma' compression tools along with tar, which operate in exactly the same way as 'gzip' and 'bzip2' but use the more advanced LZMA algorithm like 7-Zip.

4

To you first question, 7-Zip is an archiver than can use many algorithms to compress and decompress data.

To your second question, just make sure that the platform supports tools that support the given format. For example, I would avoid using RAR on a Mac. While it is possible to use, and there are free utilities that support it, they lack the much richer interface that Windows utilities that support RAR have (in my experience).

soandos
  • 24,600
  • 29
  • 105
  • 136
2

Just as an example, I use the mentioned formats in these cases:

  • Text files (logs especially): bz2
  • Collection of files to be distributed (e.g. source code): gz (tar.gz really).
  • Assorted files: 7zip. I can compress almost anything in a very efficient way. Cross-platform, open-source, stable, lightweight, file (header and data) encryption,... Can you ask for anything else? :)

I avoid RAR altogether and whenever I receive RAR file from someone I know, I tell him/her to stop using that format since it is propietary, and that probably he/she is using unlicensed software (most people download WinRAR's trial version and keep using it forever).

PS: I run Ubuntu (primarily) and Windows (both dual boot and VirtualBox).

glarrain
  • 206
1

There are at least four separate jobs that are often confused together because popular tools integrates them:

  1. Archiving: the ability to combine multiple files (including metadata) into a single file, preserving as much things as possible. In Linux/Unix world, archiving is traditionally done in TAR file format.
  2. Compression: the ability to losslessly minimize the size of a stream of binary data. In Linux/Unix world, this is traditionally done by GZip and BZip2.
  3. Encryption: the ability to scramble data with keys
  4. Checksum: the ability to detect (and possibly correct) errors.

The ubiquity of .tar.gz and .tar.bz corresponds to Unix philosophy of small tools doing a single job well, over a single tool that does everything. The TAR file format does not support compression or encryption, but it can be compressed further by any compressor (including as .tar.zip or .tar.7z). The job of GZip and BZip2 is simply to compress a file stream to another filestream, the compression layer does not need to care how to preserve metadata or encryption or checksum. Over time though, several shortcuts have been made in tar program to work with a compressor more conveniently.

In zip and 7z file format, these separate jobs are done by a single program in a single super file format.

Why does the trend above appear to hold, even though all of these are portable formats? Are there any particular advantages to using a particular archive format on a particular platform?

Because it has been the way it's done, program source codes are traditionally distributed as .tar.gz or .tar.bz2, because preserving file permissions, modification time, etc are important for various tools used for programming (e.g. make).

The separate archival and compression step has worked for years very well, it has a clear advantage of being able to freely mix and match archival and compression, and its disadvantage (a 2-step compression process) can be easily circumvented by developing smarter tools (most modern linux compression program will directly compress to .tar.gz or .tar.bz2, hiding the intermediate step).

There is no strong reason to move to other file formats, newer compressors does not have a significantly better compression rate to justify breaking the tradition and tar can preserve everything well enough.

Lie Ryan
  • 4,517
  • 3
  • 25
  • 27