3

There are two sparse files. They are proved identical by diff. But it took 20 minutes (too long time) to compare. I am thinking of taring them into tiny files to speed up the comparison. But they tar into different outputs.

They are 512GB huge sparse files, with only around 40K meaningful data.

% ls -l sparse_file_one/
total 40
-rw-r--r-- 1 midnite midnite 512711720960 Mar  4 23:12 sdd.img
% ls -l sparse_file_two/
total 48
-rw-r--r-- 1 midnite midnite 512711720960 Mar  4 23:13 sdd.img

% du sparse_file_one/sdd.img 40 sparse_file_one/sdd.img % du sparse_file_two/sdd.img 48 sparse_file_two/sdd.img

diff comparison takes 20 minutes. They are proved identical.

% diff -qs --speed-large-files sparse_file_one/sdd.img sparse_file_two/sdd.img | pv
68.0 B 0:20:57 [55.4miB/s] [     <=>                                                     ]
Files sparse_file_one/sdd.img and sparse_file_two/sdd.img are identical

As their du disk usages differ, I look into filefrag and confirm that their internal representations differ.

% filefrag -v sparse_file_one/sdd.img
Filesystem type is: ef53
File size of sparse_file_one/sdd.img is 512711720960 (125173760 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:    6866944..   6866944:      1:            
   1:     8192..    8194:    6852608..   6852610:      3:    6875136:
   2:    12288..   12288:    6854656..   6854656:      1:    6856704:
   3:    16384..   16384:    6868992..   6868992:      1:    6858752:
   4:    16448..   16449:    6869056..   6869057:      2:            
   5:    16512..   16512:    6869120..   6869120:      1:             last
sparse_file_one/sdd.img: 4 extents found

% filefrag -v sparse_file_two/sdd.img Filesystem type is: ef53 File size of sparse_file_two/sdd.img is 512711720960 (125173760 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 6871040.. 6871040: 1:
1: 8192.. 8195: 6856704.. 6856707: 4: 6879232: 2: 12288.. 12288: 6858752.. 6858752: 1: 6860800: 3: 16384.. 16384: 6860800.. 6860800: 1: 6862848: 4: 16448.. 16449: 6860864.. 6860865: 2:
5: 16512.. 16512: 6860928.. 6860928: 1:
6: 125173759..125173759: 132128862.. 132128862: 1: 132018175: last,eof sparse_file_two/sdd.img: 5 extents found

tar completes promptly. It takes literally no time. But the tar output sizes differ. No wonder they will not be compared identical.

% cd ../sparse_file_one/

sparse_file_one % tar -cvSf sdd.img.tar --mtime=@0 sdd.img | pv tar: Option --mtime: Treating date '@0' as 1970-01-01 08:00:00 sdd.img
8.00 B 0:00:00 [26.2KiB/s] [ <=> ]

sparse_file_one % ls -l total 80 -rw-r--r-- 1 midnite midnite 512711720960 Mar 4 23:12 sdd.img -rw-r--r-- 1 midnite midnite 40960 Mar 5 00:22 sdd.img.tar

% cd ../sparse_file_two

sparse_file_two % tar -cvSf sdd.img.tar --mtime=@0 sdd.img | pv tar: Option --mtime: Treating date '@0' as 1970-01-01 08:00:00 sdd.img 8.00 B 0:00:00 [ 520KiB/s] [ <=> ]

sparse_file_two % ls -l total 100 -rw-r--r-- 1 midnite midnite 512711720960 Mar 4 23:13 sdd.img -rw-r--r-- 1 midnite midnite 51200 Mar 5 00:23 sdd.img.tar

(With reference to this post, nullifying the mtime makes identical tar archives. I could make identical archives from other identical sparse or non-sparse files. But this behaviour is apparently not guaranteed.)

(According to this post, if I could extract the content of a sparse file with less than 10 minutes, it would be faster to verify they are identical. But I do not know python. It would be nice if certain Linux native program could do it.)

PS - I would prefer using diff to cmp for the directory recursive comparison possibility.

midnite
  • 601

1 Answers1

2

I think this tool I made might be useful to you: https://github.com/ArthurMLago/sparsediff

I had to compare huge sparse files too, 60G apparent size, a couple hundred k in real disk usage. I never found a good solution online, so I ended up making my own application that uses lseek, SEEK_HOLE and SEEK_DATA to efficiently look for the relevant sections of the first file, and compare with the second. Output was inspired in hexdump -C, and is intended for binary files.