Split concatenated tiff file

Question

I have a file that is multiple tiff files concatenated together. (note this is not a multipage tiff). I am looking for a way to split the file back into separate files. Preferably from the command line so that the process can be automated.

I could be way oversimplifying it but it appears that each image starts with the hex values 49 49 2A. I did some searching and have tried various suggestions for splitting binary files using AWK and SPLIT but haven't been able to get any to work for my situation.

Is there some other method I could use to get this to work?

score 2 · Accepted Answer · answered Mar 27 '12 at 20:57

If you're sure the concatenated TIFFs are all little-endian files (49 49 2A 00 magic number), then this Perl script should work. Invoke as perl foo.pl < file.tif

#!/usr/bin/env perl                                                         

my $big_endian = "MM\0*";
my $big_endian_regex = "MM\0\\*";
my $little_endian = "II*\0";
my $little_endian_regex = "II\\*\0";

my $tiff_magic = $little_endian;
my $tiff_magic_regex = $little_endian_regex;

my $n = 0;
my $fileprefix = "chunk";
my $buffer;

{ local $/ = undef; $buffer = <stdin>; }

my @images = split /${tiff_magic_regex}/, $buffer;

for my $image (@images) {
    next if $image eq '';
    my $file = sprintf("$fileprefix.%02d.tif", $n++);
    open FILE, ">", $file or die "open $file: ";
    print FILE $tiff_magic, $image or die "print $file: ";
    close FILE or die "close $file: ";
}

exit 0;

score 1 · Answer 2 · answered Jan 19 '21 at 18:47

I was playing around and came up with another way of looking for an arbitrary sequence of hex bytes in a file, so I thought I'd add a second answer. Assume you have multiple TIFFs concatenated together in a single file called manyTIFs, you can do this with just xxd and awk:

#!/bin/bash
Dump the concatenated TIFFs, one byte per line, so line number is byte offset
xxd -c1 manyTIFs | awk '
   BEGIN{
      sa[0]="49"          # sa = sought array. It contains the bytes we are seeking
      sa[1]="49"
      sa[2]="2a"
      sa[3]="00"
      si=0                # seek index, which item in "sa" we are looking for
   }
   { 
      byte=$2             # Pick up the hex byte, it is the second field, i.e. after the offset
      if(byte==sa[si]){   # if it's the one we are looking for
         si++             # look for next byte
         if(si==4){       # check if we have found all 4 bytes
            si=0          # restart the search
            print NR-4    # TIFF file started 4 bytes back
         }
      } else {
         si=0             # restart the search
      }
   }
'

It prints out the byte offsets where each TIFF begins - I'll leave it as an exercise to feed that into dd to do the actual cutting.

In case you want to see what that xxd command outputs, it is like this, which is why the awk looks at column 2:

Sample Output

00000000: 49  I
00000001: 49  I
00000002: 2a  *
00000003: 00  .
00000004: 10  .
00000005: 6c  l
00000006: 04  .

score 0 · Answer 3 · answered Jan 19 '21 at 16:05

GNU Parallel is able to split files based on a record-start and record-end string, so if you had many TIFFs all concatenated together end-to-end in a single file called manyTIFs, you could do:

parallel --recstart "II*" --pipepart -a manyTIFs 'cat > {#}.tif'

GNU Parallel pipes the divided up parts into the script in single quotes at the right end. The {#} gets replaced by the job number which is just an incrementing integer, so your files will come out named 1.tif, 2.tif and so on.

Mark Setchell · Answer 4 · 2024-04-29T17:37:19.790

As nobody was interested enough to upvote my other answers, I thought I'd add a third, completely different method:

First, make some images in TIFF big-endian, TIFF little-endian, JPG, PNG, GIF formats with ImageMagick for testing:

magick -size 640x480 xc:red image.gif
magick -size 640x480 xc:red image.jpg
magick -size 640x480 xc:red image.tif
magick -size 640x480 xc:red -define tiff:endian=msb imageMSB.tif
magick -size 640x480 xc:red -define tiff:endian=lsb imageLSB.tif

Then concatenate them all together into a big, amorphous blob and check what we have got:

cat image* > blob
ls -l image* blob
-rw-r--r--    1 root     root       3692113 Oct 13 09:27 blob
-rw-r--r--    1 root     root           903 Oct 13 09:25 image.gif
-rw-r--r--    1 root     root          3888 Oct 13 09:25 image.jpg
-rw-r--r--    1 root     root           362 Oct 13 09:27 image.png
-rw-r--r--    1 root     root       1843480 Oct 13 09:26 imageLSB.tif
-rw-r--r--    1 root     root       1843480 Oct 13 09:26 imageMSB.tif

And now the answer, which uses binwalk and shows you the byte offsets of all the files in both hex and decimal and which you can use with awk and dd to separate out your files - all with the correct extensions:

binwalk blob
DECIMAL       HEXADECIMAL     DESCRIPTION
0             0x0             GIF image data, version "89a", 640 x 480
903           0x387           JPEG image data, JFIF standard 1.01
4791          0x12B7          PNG image, 640 x 480, 1-bit colormap, non-interlaced
4926          0x133E          Zlib compressed data, best compression
5153          0x1421          TIFF image data, little-endian offset of first image directory: 1843208
1848633       0x1C3539        TIFF image data, big-endian, offset of first image directory: 1843208

Note that you can more simply extract the files with binwalk itself, using:

binwalk -e BIGBLOB.BIN

For anyone who doesn't trust or care to install binwalk, just start a docker alpine image with the current directory on the host mapped to /work in the container with:

docker run -it -v "$(pwd)":/work -w /work alpine:latest

Then, inside the container run:

echo "https://dl-cdn.alpinelinux.org/alpine/edge/testing" >> /etc/apk/repositories
apk update && apk add binwalk

score 0 · Answer 5 · answered Mar 27 '12 at 20:14

I know that for TIFF files, the first 2 bytes are char and evaluate to ascii "II" or "MM" for byte order (intel or motorola) and then 2 bytes (word) for version which should be decimal 42 (don't panic).

see for instance: http://www.fileformat.info/format/tiff/corion.htm

In your example, you are seeing II+42 intel byte order and version 42.

Split concatenated tiff file

5 Answers5

Dump the concatenated TIFFs, one byte per line, so line number is byte offset

DECIMAL HEXADECIMAL DESCRIPTION