11

Let's say we have a file /a_long_path_1/foo.doc of size, say, 12345 bytes, and we would like to find all copies of this file in directories /a_long_path_2 and /a_long_path_3 including all their subdirectories recursively. The main parts of the names of the copies may differ from foo (though the extension .doc is likely to stay the same), and the creation/modification dates may be different, but the contents of foo should be the same in its duplicates.

If I issue find /a_long_path_2 /a_long_path_3 -size 12345c -iname \*.doc, the list I get is too large to check manually via diff. Automation is needed. Additional info that might make automation hard: some directory names in the output of this find … command contain spaces.

To be clear: I do NOT wish to find all duplicates of all files on the file system (but all duplicates of only of one particular file), not even as an intermediate step. (Such a list would be huge anyway.)

1 Answers1

14

If I issue find /a_long_path_2 /a_long_path_3 -size 12345c -iname \*.doc, the list I get is too large to check manually via diff. Automation is needed.

Add -exec cmp -s /a_long_path_1/foo.doc {} \; -print:

find /a_long_path_2 /a_long_path_3 \
   -type f \
   -size 12345c \
   -iname \*.doc \
   -exec cmp -s /a_long_path_1/foo.doc {} \; \
   -print

This works because in find -exec is also a test, it succeeds iff the invoked tool returns exit status 0. cmp -s silently returns exit status 0 iff the two given files are identical.

-iname \*.doc can speed things up, but in general it may make you miss some duplicates. -type f and -size 12345c are good preliminary tests for sure.