3

I work with some big image datasets containing millions of images, and I often need to compress the results of each step of processing to be uploaded as backup.

I have seen that some datasets can be downloaded as a set of .zip files, which can be unzipped independently into the same folder as one consistent dataset. This can be pretty convenient as it enables me to pipeline the download -> decompress -> delete archive process, which is more efficient in terms of both time and storage space, as explained below with arbitrary time/sizes:

  • When decompressing a single 100GB .zip, let's say downloading takes 5 minutes and decompressing takes 10 minutes. I need 15 minutes to get all my data. Assuming the .zip had a 50% compression ratio, I need to use 100+200 = 300GB disk space.
  • When decompressing two 50GB .zip, let's say downloading each takes 2.5 minutes and decompressing each takes 5 minutes. I can do: 2.5 minutes downloading zip1, 5 minutes decompressing zip1 and 2.5 minutes downloading zip2 simultaneously, delete zip1, then decompress zip2 in 5 minutes, for a total of 2.5+5+5 = 12.5 minutes. Meanwhile, I only need to have at maximum zip2, folder1 and folder2 on disk at the same time, so 50+100+100 = 250GB of disk space.

These time and space savings increase as we increase the number of separate zip files. I am therefore looking for a way to do this.

My requirements are as such:

  • The method can work on any folder structure, no matter how deep
  • Compression results in .zip files of roughly equal size
  • All resulting archives can be decompressed independently to reconstruct part of the folder (sometimes I may want to use only part of the dataset for tests, in which case I don't want to have to decompress the entire dataset)
  • Optional:
    • The method should be able to show a progress bar
    • The method is fast and efficient

I think I would be able to write a bash or python script that fits the first few requirements, but I doubt it would be fast enough.

I am aware of the -s switch in zip and the -v switch in 7z, but they both require the users to have all the parts of the archive to be able to decompress any part of it, which is much less desirable.

2 Answers2

2

I have a script that can assist with this task. Below is an example of a Bash script that individually compresses files into distinct ZIP archives, making them separately extractible. You can execute this script within a directory containing the files to generate ZIP archives. I've tested this process, and Python, particularly with Pandas, can easily read these archives without manual extraction.

#!/bin/bash

Set the target directory

target_directory="/path/to/your/directory"

Navigate to the target directory

cd "$target_directory" || exit

Iterate through files in the directory

for file in *.csv; do if [ -f "$file" ]; then # Build the target ZIP file name zip_file="${file}.zip"

# Check if the target ZIP file already exists, if yes, skip compression
if [ -f "$zip_file" ]; then
  echo "File $zip_file already exists. Skipping compression."
else
  # Compress the file
  zip "$zip_file" "$file"
  if [ $? -eq 0 ]; then
    echo "File $file compressed successfully into $zip_file."
    # Remove the original CSV file after successful compression
    rm "$file"
  else
    echo "File $file compression failed."
  fi
fi

fi done

Running this script in the directory will create separate ZIP files for each CSV file and will delete the original CSV file upon successful compression.

0

The ZIP file format is really just a container (basically a folder) which contains compressed files. This is in contrast with the .tar.gz format which is frequently used on Linux platforms. The advantage of ZIP is that the contents can be individually extracted exactly as you are hoping to do without extracting the entire archive.

Indeed most operating systems, including Windows, natively support opening a ZIP folder to review file names and metadata without extracting the entire archive. And it isn't difficult to extract just a subset of a large directory structure (in Windows you mealy copy-paste a selection of files)
7-Zip is able to do this as well but you have to press the "Copy" button and then specify the destination.

There are issues with nested .zip files, generally the parent .zip will have to be fully extracted in order to review the children.

As an aside note, the .tar.gz format I mentioned uses the same DEFLATE algorithm as ZIP, but it can sometimes compress better since the file names and metadata is also compressed. The cost to doing this is that usually the whole archive must be extracted to review any of it's contents.