4

I need to compare two directories on Linux filesystems on two separate servers in order to identify whether all the files from SERVER1 are present on SERVER2. The total data set is about 4TB of files in all.

The data has been copied across using rsync but I cannot take the chance that anything is missing as the source data is going to be purged once the migration is complete.

I have tried a number of approaches to compare the data (diff of the recursive directory listing, rsync in dry-run mode) but can't find anything that gives me a manageable output or doesn't take forever to run.

Interested to hear different approaches as so far I don't have one I'm happy with.

2 Answers2

7

nohup rsync -r --checksum [--delete] --itemize-changes --dry-run source/ target/ > output_file 2>&1 &

--checksum: rsync > v3 will use md5 with a chance of collision 4.3 x 10^60. As pointed out, you have to settle on it taking time but this will do the work for you.

--itemize-changes (and/or --verbose) will show actual changes

Without --dry-run, it will automatically copy files of different sizes but if the sizes are identical, checksum them and copy if different. No other check is done. The delete option will remove files from the target if not present on the source. Errors will go to output_file.

If in doubt, run a test between two smaller directories first.

chipfall
  • 216
4

You can run on both server command like:

find /starting/path -type f -exec sha1sum {} \; >>/tmp/sums

and then compare files /tmp/sums from the servers. This command will calculate hash of files and you can be sure (with very high probability) they are the same. You can use also sha256, sha512 to decrease the probability of collision (two different files with the same hash)

Romeo Ninov
  • 7,848