Compute md5sum of every 1MB part of stream from pipe

Question

I want to do checksumming of large files and stream in unix/linux, and I want get many checksums from every large part of file/stream, every 1 MB or every 10MB.

For example, I have disk image, compressed disk image, and the copy of the original disk. Some parts of images may be modified. The disk is 50 GB, and there is around 50000 of 1 MB blocks. So for every file I want to get 50 000 md5sum or sha1sums to get overview of modifications. Single md5sum will not help me to locate modification offset.

This task is easy for uncompressed disk image, with using dd tool in for loop in bash with computing offsets and selecting (skipping) every 1MB part of file. The same with the disk:

for a in `seq 1 50000`; do echo -n "$a: "; dd if=image.src bs=1M count=1 skip=$a | md5sum; done

But now I want to compare compressed image and uncompressed one without unpacking it to the disk. I have 7z unpacker which can unpack the image to stdout with high speed, up to 150-200 MB/s (options 7z e -so image.7z |). But what can I write after the | symbol to get md5sum of all file parts.

score 9 · Answer 1 · answered May 20 '14 at 02:13

9

split from coreutils (the default on most Linux distributions) has a --filter option which you can use:

7z e -so image.7z | split -b 1000000 --filter=md5sum

answered May 20 '14 at 02:13

Cristian Ciupitu

5,693

score 2 · Accepted Answer · answered May 19 '14 at 23:39

Something simple like this Perl script probably would suffice.

$amount = 1_000_000;
while (read(STDIN, $buffer, $amount) > 0) {
    open MD5, "|md5";
    print MD5 $buffer;
    close MD5;
}

Put this in foo.pl and invoke it as perl foo.pl at the end of your pipeline.

score 0 · Answer 3 · answered May 19 '14 at 23:21

It seems to me that you're looking for this kind of a tool.

From the Readme file of BigSync:

Bigsync is a tool to incrementally backup a single large file to a slow destination (think network media or a cheap NAS). The most common cases for bigsync are disk images, virtual OSes, encrypted volumes and raw devices.

Bigsync will read the source file in chunks calculating checksums for each one. It will compare them with previously stored values for the destination file and overwrite changed chunks if checksums differ.

This way we minimize the access to a slow target media which is the whole point of bigsync's existence.

score 0 · Answer 4 · answered May 19 '14 at 23:39

It was easy to write small 1MB hasher using rhash tools (librhash library). There is simple perl script which creates checksums of each 1MB part of standard input stream. It needs Crypt::Rhash bindings from cpan:

$ cpan
(cpan) install Crypt::Rhash
$ cat rhash1M.pl
#!/usr/bin/perl
# Compute md5 and sha1 sum of every 1 MB part of stream

use strict;
use local::lib;
use Crypt::Rhash;

my ($buf, $len, $i);
my $r=Crypt::Rhash->new(RHASH_MD5|RHASH_SHA1);
# we can add more hashes, like RHASH_TIGER etc
binmode STDIN;
$i=0;
while($len= read STDIN,$buf,1024*1024){
    print "$i+$len: \t"; # print offset
    $r->update($buf);
    print "md5:",$r->hash(RHASH_MD5), " sha1:", $r->hash(RHASH_SHA1),"\n";
    $r->reset(); # reset hash calculator
    $i+=$len; 
}

This public domain script will output decimal offset, then +, then block size, then md5 and sha1 sums of input.

For example 2 MB of zeroes has sums:

$ dd if=/dev/zero of=zerofile bs=1M count=2
$ ./rhash1M.pl < zerofile 
0+1048576:  md5:b6d81b360a5672d80c27430f39153e2c sha1:3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3 
1048576+1048576:    md5:b6d81b360a5672d80c27430f39153e2c sha1:3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3

score 0 · Answer 5 · answered May 20 '14 at 00:46

rsync works like this, computing a checksum to see if there are differences in parts of the file before sending anything.

I'm not sure how well it would function with files this large, although I have never heard of it having any file size limitation.

Cristian Ciupitu · Answer 6 · 2014-05-20T12:12:30.967

0

Pipe the output to this Python 2 script, for example 7z e -so image.7z | python md5sum.py:

import sys, hashlib
CHUNK_SIZE = 1000 * 1000
for chunk in iter(lambda: sys.stdin.read(CHUNK_SIZE), ''):
    print hashlib.new('md5', chunk).hexdigest()

edited May 20 '14 at 12:12

answered May 20 '14 at 01:37

Cristian Ciupitu

5,693

Compute md5sum of every 1MB part of stream from pipe

6 Answers6