How to compute the total size of files included by tar after exclude flags?

Question

I'm building on the answer here to write a backup script. The script I have is roughly

backup_files="/etc /home"
excludes="--exclude-vcs --exclude-ignore-recursive=.tarignore"
#(Skip irrelevant details)
total_size= du -csb $backup_files |awk '{print $1}'|tail -n 1
tar cf - $excludes $backup_files -P | pv -s $total_size | gzip > "$target_file"

Only, the computation for total_size ends up overestimating the amount of time. I've been fiddling around with the script to tighten the estimate, but I'm encountering some problems. For instance, I have tried

all_files=$(tar cvf /dev/null $excludes $backup_files -P |grep -v -e /$)
total_size=$(du -csb $all_files)

Which runs into the issue of too many arguments (approximately a million files). Iterating over this with a for loop runs into issues with filenames. Among other things, spaces break the loop and some odd Unicode filenames break stuff. Also, I tried timing the loop and it would take hours.

With a few pointers from comments and a now deleted answer, I've gotten as far as

run_tar () {
  printf '%s\n' "$excludes" "$backup_files" | tar -cSPf - --files-from -
}
list_files () {
  printf '%s\n' "$excludes" "$backup_files" | tar -cvPf /dev/null --files-from - | grep -v -e /$
}
compute_size(){
  list_files | while read -r f;
  do
    echo -ne "$f\0"
  done | du -csb --files0-from - |awk '{print $1}'|tail -n 1
}

This fixes the overhead from the for loop and the problems with spaces. Currently, it takes about a minute or two to process a million or so files.

Where I'm still stuck with are the Unicode errors. The filenames are rendered as e.g. Yle P\344\344uutiset.xml. Forwarding errors to /dev/null hides the problem, and this is a handful of files anyway. A ls of one of the misbehaving directories shows that there's a file called 'Yle P'$'\344\344''uutiset.xml'. I think this instance is a case of filename breakage but the issue remains that these are still valid filenames. For that matter, the newline character is also a valid filename separator.

How do I include the few files that I'm missing from the total?

mumbling facecloth · Accepted Answer · 2024-04-05T15:04:01.317

You ask how to compute beforehand exactly how many bytes tar will process, so you can tell pv the total amount of data that will pass through it, in order for it to calculate accurate progress statistics.

This can be done by instructing tar to write to /dev/null, so no data is actually read or written, and using the --totals option which prints the total bytes afterwards, e.g.:

tar --create --file /dev/null --totals --exclude=PATTERN FILE...

Which would output something like:

Total bytes written: 513318768640 (479GiB, 464GiB/s)

When we run

tar --create --exclude=PATTERN FILE... | wc -c

(which does read all data) we can see the number of bytes passed through the pipe is indeed exactly as reported before.

Now in order to store only the number itself in a variable, we can pipe the output to awk, and it turns out the totals are written to standard error by tar—not standard in—so we need to use |& (or 2>&1 |) instead of |:

total_size=$(tar -cf /dev/null --totals --exclude=PATTERN FILE... |& awk '{print $4}')

And your actual archiving would then be done with:

tar --create --exclude=PATTERN FILE... | pv -s "$total_size" | gzip > "$target_file"

Which will show you an accurate progress meter.

How to compute the total size of files included by tar after exclude flags?

1 Answers1