How to compare huge sparse files efficiently?

Question

There are two sparse files. They are proved identical by diff. But it took 20 minutes (too long time) to compare. I am thinking of taring them into tiny files to speed up the comparison. But they tar into different outputs.

They are 512GB huge sparse files, with only around 40K meaningful data.

% ls -l sparse_file_one/
total 40
-rw-r--r-- 1 midnite midnite 512711720960 Mar  4 23:12 sdd.img
% ls -l sparse_file_two/
total 48
-rw-r--r-- 1 midnite midnite 512711720960 Mar  4 23:13 sdd.img
% du sparse_file_one/sdd.img
40      sparse_file_one/sdd.img
% du sparse_file_two/sdd.img 
48      sparse_file_two/sdd.img

diff comparison takes 20 minutes. They are proved identical.

% diff -qs --speed-large-files sparse_file_one/sdd.img sparse_file_two/sdd.img | pv
68.0 B 0:20:57 [55.4miB/s] [     <=>                                                     ]
Files sparse_file_one/sdd.img and sparse_file_two/sdd.img are identical

As their du disk usages differ, I look into filefrag and confirm that their internal representations differ.

% filefrag -v sparse_file_one/sdd.img
Filesystem type is: ef53
File size of sparse_file_one/sdd.img is 512711720960 (125173760 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:    6866944..   6866944:      1:            
   1:     8192..    8194:    6852608..   6852610:      3:    6875136:
   2:    12288..   12288:    6854656..   6854656:      1:    6856704:
   3:    16384..   16384:    6868992..   6868992:      1:    6858752:
   4:    16448..   16449:    6869056..   6869057:      2:            
   5:    16512..   16512:    6869120..   6869120:      1:             last
sparse_file_one/sdd.img: 4 extents found
% filefrag -v sparse_file_two/sdd.img
Filesystem type is: ef53
File size of sparse_file_two/sdd.img is 512711720960 (125173760 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:    6871040..   6871040:      1:

   1:     8192..    8195:    6856704..   6856707:      4:    6879232:
   2:    12288..   12288:    6858752..   6858752:      1:    6860800:
   3:    16384..   16384:    6860800..   6860800:      1:    6862848:
   4:    16448..   16449:    6860864..   6860865:      2:

   5:    16512..   16512:    6860928..   6860928:      1:

   6: 125173759..125173759:  132128862.. 132128862:      1:  132018175: last,eof
sparse_file_two/sdd.img: 5 extents found

tar completes promptly. It takes literally no time. But the tar output sizes differ. No wonder they will not be compared identical.

% cd ../sparse_file_one/
sparse_file_one % tar -cvSf sdd.img.tar --mtime=@0 sdd.img | pv
tar: Option --mtime: Treating date '@0' as 1970-01-01 08:00:00
sdd.img

8.00 B 0:00:00 [26.2KiB/s] [  <=>                                              ]
sparse_file_one % ls -l
total 80
-rw-r--r-- 1 midnite midnite 512711720960 Mar  4 23:12 sdd.img
-rw-r--r-- 1 midnite midnite        40960 Mar  5 00:22 sdd.img.tar
% cd ../sparse_file_two
sparse_file_two % tar -cvSf sdd.img.tar --mtime=@0 sdd.img | pv
tar: Option --mtime: Treating date '@0' as 1970-01-01 08:00:00
sdd.img
8.00 B 0:00:00 [ 520KiB/s] [  <=>                                              ]
sparse_file_two % ls -l
total 100
-rw-r--r-- 1 midnite midnite 512711720960 Mar  4 23:13 sdd.img
-rw-r--r-- 1 midnite midnite        51200 Mar  5 00:23 sdd.img.tar

(With reference to this post, nullifying the mtime makes identical tar archives. I could make identical archives from other identical sparse or non-sparse files. But this behaviour is apparently not guaranteed.)

(According to this post, if I could extract the content of a sparse file with less than 10 minutes, it would be faster to verify they are identical. But I do not know python. It would be nice if certain Linux native program could do it.)

PS - I would prefer using diff to cmp for the directory recursive comparison possibility.

Arthur Moraes Do Lago · Answer 1 · 2022-10-19T16:09:26.910

I think this tool I made might be useful to you: https://github.com/ArthurMLago/sparsediff

I had to compare huge sparse files too, 60G apparent size, a couple hundred k in real disk usage. I never found a good solution online, so I ended up making my own application that uses lseek, SEEK_HOLE and SEEK_DATA to efficiently look for the relevant sections of the first file, and compare with the second. Output was inspired in hexdump -C, and is intended for binary files.

How to compare huge sparse files efficiently?

1 Answers1