3

Given: A generic binary file and a block size

Desired output: A copy of the binary file where all the blocks that contain only Zero-Bits/Bytes have been removed/stripped from the file

I really wonder why I cannot find a tool that does this simple job. I created a small script, but its performance is ridiculous. There must be an existing software able to do this, isn't there?!

Maybe the issue with finding this is caused by the fact that there are so many terms that can be used to express this need...

Edit: The sed thread you mentions replaces every byte, I just want to replace 0-bytes if there are at least blocksize many in a row.

I want to investigate a very large very sparse file (not sparse as in sparse file in the file system) and for this analysis I want to cut out irrelevant parts

EDIT 2: The file size is in the order of 10 to 1000 GB. For small sizes, my slow own tool is alright, but for such larger files ...

1 Answers1

0

bbe is "a sed-like editor for binary files". In Debian it's in the bbe package.

It would be best if you could do s/^\0*$// to identify blocks full of null bytes and remove them. My tests indicate such regex-like expressions don't work in bbe. You can still use (almost) as many \0 as you need:

s/\0\0…\0\0//

where denotes the right number of \0 substrings. If you choose large block size, then it may be problematic to pass an accordingly long string via the command line. Fortunately bbe supports reading a script from a file. Proceed like this:

# The following function uses non-POSIX 'for' loop. Rewrite if necessary.
gen_script() {
   printf 's/'
   for ((i=0;i<"$1";i++)); do
      printf '\\0'
   done
   printf '//\n'
}

# This needs to be a plain decimal number:
blocksize=512

gen_script "$blocksize" > bbe-script
<binary_file_in bbe -b ":$blocksize" -f bbe-script >binary_file_out

Problems:

  1. The above implementation of gen_script is pretty slow, rather impractical for large blocksize.
  2. In my tests bbe misbehaved for blocksize greater than 16384 (i.e. blocks of 16 KiB). This makes the first problem irrelevant.
  3. In this role bbe itself seems not very fast either. I don't know how large your "very large file" is. If I were you I would try

    pv binary_file_in | bbe -b ":$blocksize" -f bbe-script >binary_file_out
    

    and after few seconds I would be able to tell if ETA is acceptable.