How to diff large files on Linux

Question

I'm getting a diff: memory exhausted error when trying to diff two 27 GB files that are largely similar on a Linux box with CentOS 5 and 4 GB of RAM. This is a known problem, it seems.

I would expect there to be an alternative for such an essential utility, but I can't find one. I imagine the solution would have to use temporary files rather than memory to store the information it needs.

I tried to use rdiff and xdelta, but they are better for showing the changes between two files, like a patch, and are not that useful for inspecting the differences between two files.
Tried VBinDiff, but it is a visual tool which is better for comparing binary files. I need something that can pipe the differences to STDOUT like regular diff.
There are a lot of other utilities such as vimdiff that only work with smaller files.
I've also read about Solaris bdiff but I could not find a port for Linux.

Any ideas besides splitting the file into smaller pieces? I have 40 of these files so trying to avoid the work of breaking them up.

score 16 · Answer 1 · edited Dec 19 '13 at 17:57

cmp does things byte-by-byte, so it probably won't run out of memory (just tested it on two 7 GB files) -- but you might be looking for more detail than a list of "files X and Y differ at byte x, line y". If the similarities of your files are offset (e.g., file Y has an identical block of text, but not at the same location), you can pass offsets to cmp; you could probably turn it into a resynchronizing compare with a small script.

Aside: In case anyone else lands here when looking for a way to confirm that two directory structures (containing very large files) are identical: diff --recursive --brief (or diff -r -q for short, or maybe even diff -rq) will work and not run out of memory.

score 6 · Answer 2 · answered Aug 11 '10 at 15:24

I found this link

diff -H might help, or you can try installing the textproc/2bsd-diff port which apparently doesn't try to load the files into RAM, so it can work on large files more easily.

I'm not sure if you tried those two options or if they might work for you. Good luck.

score 2 · Answer 3 · answered Feb 02 '13 at 18:49

If the files are identical (same length) except for a few byte values, you can use a script like following (w is the number of bytes per line to hexdump, adjust to your display width):

w=12;
while read -ru7 x && read -ru8 y;
do
  [ ".$x" = ".$y" ] || echo "$x | $y";
done 7< <(od -vw$w -tx1z FILE1) 8< <(od -vw$w -tx1z FILE2) > DIFF-FILE1-FILE2 &

less DIFF-FILE1-FILE2

It's not very fast, but does the job.

score 1 · Answer 4 · answered Apr 18 '20 at 12:31

If the files have the same number of lines and differ by the content of a few of them, use the following command. Substitute \a (alert) with any other character that does not occur within the files.

paste -d $'\a' file1 file2 | awk -F$'\a' '$1 != $2'

This works by pairing the lines of the two files and then comparing each pair.

Otheus · Answer 5 · 2023-02-10T21:30:12.220

So, not exactly the OPs problem, but a related problem is that you have two large database dumps, each insert/record on its own row, but varying differences in floating-point implementation result in numbers that are off by some IEEE error. Thanks to the answer provided by @Diomidis, and a sprawling one-line awk script show below, we get a full functioning, efficient fuzzy-differ.

Add the text below to some script directory as fuzzy-compare.awk, tune the parameters in the BEGIN section as needed (locale-specific, debugging modes, etc), then pipe the output of paste into it:

paste -d $'\a' file1 file2 | awk -f fuzzy-compare.awk

Sample output:

Line 1 diffs found so far: 1 here at field: 4
75747358        1       53      2011-03-29 23:00:00+00  7.428
75747358        1       53      2011-03-28 23:00:00+00  7.428
Line 2 diffs found so far: 2 here at field: 4
75747359        1       53      2011-03-29 23:30:00+00  5.757
75747359        1       53      2011-03-29 23:30:00+01  5.757
Line 3 diffs found so far: 3 here at field: 3
75747360        1       53      2011-03-30 00:00:00+00  6.739
75747360        1       54      2011-03-30 00:00:00+00  6.74
Line 5 diffs found so far: 4
75747362        1       53      2011-03-30 01:00:00+00  6.736   extra-field
75747362        1       53      2011-03-30 01:00:00+00  6.73599999999999977

With diff showing:

# diff sample.sql sample2.sql
1,3c1,3
< 75747358      1       53      2011-03-29 23:00:00+00  7.428
< 75747359      1       53      2011-03-29 23:30:00+00  5.757
< 75747360      1       53      2011-03-30 00:00:00+00  6.739
---
> 75747358      1       53      2011-03-28 23:00:00+00  7.428
> 75747359      1       53      2011-03-29 23:30:00+01  5.757
> 75747360      1       54      2011-03-30 00:00:00+00  6.74
5,13c5,13
< 75747362      1       53      2011-03-30 01:00:00+00  6.736   extra-field
< 75747363      1       53      2011-03-30 01:30:00+00  7.576
< 75747364      1       53      2011-03-30 02:00:00+00  6.789
< 75747365      1       53      2011-03-30 02:30:00+00  6.386e+2
< 75747366      1       53      2011-03-30 03:00:00+00  6.016E-2
< 75747367      1       53      2011-03-30 03:30:00+00  6.336
< 75747368      1       53      2011-03-30 04:00:00+00  6.1
< 75747374      1       53      2011-03-30 07:00:00+00  5.9412
< 75747375      1       53      2011-03-30 07:30:00+00  6.137803249

> 75747362      1       53      2011-03-30 01:00:00+00  6.73599999999999977
> 75747363      1       53      2011-03-30 01:30:00+00  7.576e+10
> 75747364      1       53      2011-03-30 02:00:00+00  6.789e-10
> 75747365      1       53      2011-03-30 02:30:00+00  6.38600000000000012e+2
> 75747366      1       53      2011-03-30 03:00:00+00  6.01600000000000001E-2
> 75747367      1       53      2011-03-30 03:30:00+00  6.3360000000000003
> 75747368      1       53      2011-03-30 04:00:00+00  6.0999999999999993
> 75747374      1       53      2011-03-30 07:00:00+00  5.94099999999999984
> 75747375      1       53      2011-03-30 07:30:00+00  6.13780324900000007

Code below (duplicated to a github gist: https://gist.github.com/otheus/92162e3a764d2697c3272b98b2663a94).

#!/bin/awk -f 
## Awk script to compare to SQL (postgres) dumps for which each line of input is a row
## and has been preprocessed by 
##   paste -d $'\a' file1 file2 
## The BEL symbol is used by this program to quickly split the input
##   
## Sometimes, numbers differ by some kind of rounding error / floating-point implementation
## Ignore that error by subtracting the two values and seeing if they are < maxdiff,
##     maxdiff = 1 / (10 ^ (length-after-decimal-point(shortest-value)) 
## Consider:
##   4.2  vs 4.19998
## The shortest number is 4.2, its length is
Notes:
d is the global diff counter
p is the position / field that first had a difference
i is a loop variable,usually current field
L is the array of fields from the current line of the Left-file
R is  "    "    "    "     "   "    "  "    "   "  "  Right-file
clhs is the number of fields in L
crhs is the number of fields in R
BEGIN { 
  FS="\a";
  DECIMAL_SEP=".";
  FIELD_SEP="\t";  # for postgresql; for mysql, maybe ", ";
  MAX_DIFFS=10;
  DEBUG=0;
Efficiently fill out our table of maximum tolerances of values
Maxdiffs[1] = 0.1;
  for (i=2; i<31; ++i)
    Maxdiffs[i] = Maxdiffs[i-1] / 10;
  p=-1; # everything starts out fine.
}
if -v start=...., skip until that line
NR < (0 + start) { next }
When pairs don't match, investigate further...
("" $1) != ("" $2) {
    if (DEBUG>1) print "Line",NR ": Input lines differed somehow. Investigating...";
    p=0;  # p is field# where difference was found; 0 means whole line
    # split each half into tab-delimited fields
    clhs=split($1,L,FIELD_SEP);
    crhs=split($2,R,FIELD_SEP);
if (clhs == crhs) { 
if (DEBUG&gt;1) print &quot;Line&quot;,NR &quot;: Same number of tokens in each line, delimited by '&quot; FIELD_SEP &quot;'&quot;;
    ## compare field by field
p = -1;  # if we don't set p in the loop below, no real differences

# Compare each field, until a difference is found
for (i=1; i&lt;=clhs &amp;&amp; p&lt;0; ++i) {  
    # Hint: force this compare to be string-based
    if ((&quot;_&quot; L[i]) != (&quot;_&quot; R[i])) { 
    if (DEBUG&gt;1) print &quot;Line&quot;,NR &quot;: Field&quot;,i,&quot;differs somehow&quot;;

    ## They differ... but are they numbers?
    if ( \
      L[i] ~ /^-?[0-9]+\.[0-9]+([eE][-+][0-9]+)?$/ &amp;&amp; \
      R[i] ~ /^-?[0-9]+\.[0-9]+([eE][-+][0-9]+)?$/ \
    ) {  
        # both fields are floating-point numbers, compare loosely

        # strip exponent part
        sub(/[eE].*/,&quot;&quot;,L[i]);sub(/[eE].*/,&quot;&quot;,R[i]); 
        # determine precision of shortest value
        precision=( \
            length(L[i]) &lt; length(R[i]) ?  \
        length(L[i]) - index(L[i],DECIMAL_SEP) :  \
        length(R[i]) - index(R[i],DECIMAL_SEP)    \
        ); 
        # look up the maxdiff from our table
        maxdiff=Maxdiffs[precision]; 

        diff=(L[1] - R[1]); 
        if (diff &gt; maxdiff || diff &lt; -maxdiff) {
        if (DEBUG) print &quot;Line&quot;,NR &quot;: Numbers differed at&quot;,i,&quot;between&quot;,L[i],&quot;and&quot;,R[i],&quot;differing more than&quot;,maxdiff;
        p=i;
        }
        else {
        if (DEBUG) print &quot;Line&quot;,NR &quot;: Numbers differed at&quot;,i,&quot;between&quot;,L[i],&quot;and&quot;,R[i],&quot;but differed less than&quot;,maxdiff;
        }
    } 
    else {
      if (DEBUG) print &quot;Line&quot;,NR &quot;: Strings or ints differed at&quot;,i,&quot;between&quot;,L[i],&quot;and&quot;,R[i];
      p=i;
    }
    }
    else { 
      if (DEBUG) print &quot;Line&quot;,NR &quot;: No differences found&quot;;
    }
} 
}
# else, field count is different, so whole line is.
else { 
  if (DEBUG) print &quot;Line&quot;,NR &quot;: Number of fields in line differ&quot;;
}

}
p>=0 { 
    ++d;  # bump total diffs count
    # Output a little header for each non-matching records
    print "Line",NR,"diffs found so far:",d,(p ? "here at field: "  p : "" ); 
    # Output the lines that didnt match
    print $1; print $2; print ""; 
    p=-1;
}
Progress counter
NR % 100000 == 0 { print "Line",NR } 
d > MAX_DIFFS { exit(1);}

Note, the above code was a 1-liner prior to publication.

score 0 · Answer 6 · answered May 01 '20 at 17:36

This may not work for all types of files, but if your files have a regular structure to them you may be able to split them into smaller chunks and diff the chunks individually.

For example:

csplit large-file.txt '/separator pattern/' '{*}'

Caveat: this only works if your file has something you can use a separator without producing hundreds of small files and where the smaller chunks are still comparable.

How to diff large files on Linux

6 Answers6

Notes:

d is the global diff counter

p is the position / field that first had a difference

i is a loop variable,usually current field

L is the array of fields from the current line of the Left-file

R is " " " " " " " " " " " Right-file

clhs is the number of fields in L

crhs is the number of fields in R

Efficiently fill out our table of maximum tolerances of values

if -v start=...., skip until that line

When pairs don't match, investigate further...

Progress counter

Linked