Which is most space effective for git repositories: LibreOffice/OpenOffice .odt files or .fodt files? I think that it should be .fodt, since with the uncompressed XML format .fodt the repository compressor will be able to effectively use redundancies between files, whereas the .odt files are already compressed and can't be compressed further. But that's just a guess. Does anyone have any practical experience?
- 751
3 Answers
I performed following test:
Put 5 revision of small .odt to repository. I made small changes in every odt-document.
I commit similar data for .FODT. They are obtained by extracting corresponding revision of ODT and saving it to .FODT.
My results follow:
before gc after gc
odt 260k 260k
fodt 118k 38k
Note, I measured size of .git directory where revision actually saved.
I did not accounted ODT/FODT-file itself because this does not give reasonable results.
I measured size of .git folder only.
ODT is very similar to zipped FODT and it is expected that FODT is much larger than ODT.
Since it may sense to estimate grows of Git history then ODT/FODT should be excluded when measuring size of Git, because they are stored directly one time only independently of history length. In long run history consumes most space, so to obtain more relevant measurements on simple test the documents itself in working directory should be NOT be accounted when estimating the size of repo.
- 490
Doing some quick-and-dirty testing, I put ten revisions of a tiny .odt file into a bare git repository, then put the same ten revisions of the document in .fodt format into a different bare repository. The resulting sizes of the repository:
before gc after gc
odt 408k 188k
fodt 399k 148k
So .fodt offers a very slight saving in repository size, even though the .fodt file itself is 2.7 times bigger than the equivalent .odt file.
- 751
.fodt format should generally be used over .odt due to it being text-based.
Git will track changes between files when the files are similar, allowing a commit to be stored as a small amount of data as a diff. When the files change more drastically, it will fallback to storing the whole file, as the diff may end up being larger than the new file.
For many binary file formats, especially compressed files, a small change may drastically change the content of the file, and thus each change results in a new file being stored, rather than just the differences (Git doesn't magically understand a file format[1]). Text-based formats should be preferred for files that change often when you are concerned with optimizing the repository. Note that if a file rarely changes, it may go either way, as while Git compresses text files for storage, the compression of the .odt format could outperform that of Git, given that it has more knowledge of the file's use (though Git's compression algorithm can in some cases outperform the other; .odt is not a highly specific format -- it's XML).
[1] Note that if you can script Git to convert losslessly between file formats (.odt and .fodt may be compatible), and you prefer the .odt format, you can script git to convert for storing in source control and convert back for storing in the local repository (see answers here https://stackoverflow.com/questions/8001663/can-git-treat-zip-files-as-directories-and-files-inside-the-zip-as-blobs as an example).
- 516