0

I have a file that is multiple tiff files concatenated together. (note this is not a multipage tiff). I am looking for a way to split the file back into separate files. Preferably from the command line so that the process can be automated.

I could be way oversimplifying it but it appears that each image starts with the hex values 49 49 2A. I did some searching and have tried various suggestions for splitting binary files using AWK and SPLIT but haven't been able to get any to work for my situation.

Is there some other method I could use to get this to work?

Thomas W.
  • 463
matthew
  • 1,037

5 Answers5

2

If you're sure the concatenated TIFFs are all little-endian files (49 49 2A 00 magic number), then this Perl script should work. Invoke as perl foo.pl < file.tif

#!/usr/bin/env perl                                                         

my $big_endian = "MM\0*";
my $big_endian_regex = "MM\0\\*";
my $little_endian = "II*\0";
my $little_endian_regex = "II\\*\0";

my $tiff_magic = $little_endian;
my $tiff_magic_regex = $little_endian_regex;

my $n = 0;
my $fileprefix = "chunk";
my $buffer;

{ local $/ = undef; $buffer = <stdin>; }

my @images = split /${tiff_magic_regex}/, $buffer;

for my $image (@images) {
    next if $image eq '';
    my $file = sprintf("$fileprefix.%02d.tif", $n++);
    open FILE, ">", $file or die "open $file: ";
    print FILE $tiff_magic, $image or die "print $file: ";
    close FILE or die "close $file: ";
}

exit 0;
Kyle Jones
  • 6,364
1

I was playing around and came up with another way of looking for an arbitrary sequence of hex bytes in a file, so I thought I'd add a second answer. Assume you have multiple TIFFs concatenated together in a single file called manyTIFs, you can do this with just xxd and awk:

#!/bin/bash

Dump the concatenated TIFFs, one byte per line, so line number is byte offset

xxd -c1 manyTIFs | awk ' BEGIN{ sa[0]="49" # sa = sought array. It contains the bytes we are seeking sa[1]="49" sa[2]="2a" sa[3]="00" si=0 # seek index, which item in "sa" we are looking for } { byte=$2 # Pick up the hex byte, it is the second field, i.e. after the offset if(byte==sa[si]){ # if it's the one we are looking for si++ # look for next byte if(si==4){ # check if we have found all 4 bytes si=0 # restart the search print NR-4 # TIFF file started 4 bytes back } } else { si=0 # restart the search } } '

It prints out the byte offsets where each TIFF begins - I'll leave it as an exercise to feed that into dd to do the actual cutting.

In case you want to see what that xxd command outputs, it is like this, which is why the awk looks at column 2:

Sample Output

00000000: 49  I
00000001: 49  I
00000002: 2a  *
00000003: 00  .
00000004: 10  .
00000005: 6c  l
00000006: 04  .
0

GNU Parallel is able to split files based on a record-start and record-end string, so if you had many TIFFs all concatenated together end-to-end in a single file called manyTIFs, you could do:

parallel --recstart "II*" --pipepart -a manyTIFs 'cat > {#}.tif'

GNU Parallel pipes the divided up parts into the script in single quotes at the right end. The {#} gets replaced by the job number which is just an incrementing integer, so your files will come out named 1.tif, 2.tif and so on.

0

As nobody was interested enough to upvote my other answers, I thought I'd add a third, completely different method:

First, make some images in TIFF big-endian, TIFF little-endian, JPG, PNG, GIF formats with ImageMagick for testing:

magick -size 640x480 xc:red image.gif
magick -size 640x480 xc:red image.jpg
magick -size 640x480 xc:red image.tif
magick -size 640x480 xc:red -define tiff:endian=msb imageMSB.tif
magick -size 640x480 xc:red -define tiff:endian=lsb imageLSB.tif

Then concatenate them all together into a big, amorphous blob and check what we have got:

cat image* > blob

ls -l image* blob

-rw-r--r-- 1 root root 3692113 Oct 13 09:27 blob -rw-r--r-- 1 root root 903 Oct 13 09:25 image.gif -rw-r--r-- 1 root root 3888 Oct 13 09:25 image.jpg -rw-r--r-- 1 root root 362 Oct 13 09:27 image.png -rw-r--r-- 1 root root 1843480 Oct 13 09:26 imageLSB.tif -rw-r--r-- 1 root root 1843480 Oct 13 09:26 imageMSB.tif

And now the answer, which uses binwalk and shows you the byte offsets of all the files in both hex and decimal and which you can use with awk and dd to separate out your files - all with the correct extensions:

binwalk blob

DECIMAL HEXADECIMAL DESCRIPTION

0 0x0 GIF image data, version "89a", 640 x 480 903 0x387 JPEG image data, JFIF standard 1.01 4791 0x12B7 PNG image, 640 x 480, 1-bit colormap, non-interlaced 4926 0x133E Zlib compressed data, best compression 5153 0x1421 TIFF image data, little-endian offset of first image directory: 1843208 1848633 0x1C3539 TIFF image data, big-endian, offset of first image directory: 1843208


Note that you can more simply extract the files with binwalk itself, using:

binwalk -e BIGBLOB.BIN

For anyone who doesn't trust or care to install binwalk, just start a docker alpine image with the current directory on the host mapped to /work in the container with:

docker run -it -v "$(pwd)":/work -w /work alpine:latest

Then, inside the container run:

echo "https://dl-cdn.alpinelinux.org/alpine/edge/testing" >> /etc/apk/repositories
apk update && apk add binwalk
0

I know that for TIFF files, the first 2 bytes are char and evaluate to ascii "II" or "MM" for byte order (intel or motorola) and then 2 bytes (word) for version which should be decimal 42 (don't panic).

see for instance: http://www.fileformat.info/format/tiff/corion.htm

In your example, you are seeing II+42 intel byte order and version 42.

horatio
  • 3,719