2

I have two folders/directories: C:\MyData and C:\MyDataBackup and the person that owns those two folders/directories and does not remember if they have edited the files in the original or in the Backup.

I want to get rid of C:\MyDataBackup, so I have to find all the files in there that are identical to their siblings in C:\MyData and delete them, and then have the owner handle the handful of remaining files manually.

How can I achieve that? The duplicate detection tools I have used so far usually have the shortcoming of...

  • ...searching for duplicates inside C:\MyData and C:\MyDataBackup as well. That is not allowed! Those files must not be deleted, since they are intended. And since the data piles are huge, it would slow down the search for weeks.
  • ...not doing complete byte-by-byte comparisons but just relying on hash sums.
  • ...not sticking to the same path. Eg. they mark C:\MyData\task1\done.txt as identical to C:\MyDataBackup\task1\done.txt and C:\MyDataBackup\task57\done.txt.

So, how can I do a duplicates search

  • in two folders/directories, finding only pairs over both and not within each
  • with complete comparison (byte-by-byte)
  • with restriction to the same path inside the respective folder/directory?

I am using Windows, but have Cygwin, so I can use bash magic as well.

Kurtibert
  • 187

2 Answers2

2

Preliminary note

Test the solution on some expendable pair of directories first.


Solution

This answer uses *nix tools. It should work in Cygwin. I mean in a shell (like bash) provided by Cygwin. (The shell is important, see this question.)

To be DRY, I will use shell variables. If you ever need to apply this answer to other directories then it's enough to change the variables, while commands that follow are static. Use absolute paths. Run this snippet to set the variables:

reference='/cygdrive/c/MyData'
mutable='/cygdrive/c/MyDataBackup'

(Single-quotes are not necessary in this particular case; however users without experience who want to process directories with spaces in names will probably appreciate the quotes being already in the right places.)

You need to cd to the mutable directory. If the below command fails for any reason, abort.

cd -- "$mutable"

This is a command that does the real work:

find . -type f \
       -print \
       -exec test -f "$reference"/{} \; \
       -exec cmp -- {} "$reference"/{} \; \
       -delete

Explanation

  • . defines our starting point, the current working directory. Thanks to the prior cd this will be the mutable directory. We don't use "$mutable" as the starting point, because we need find to consider relative paths so we can concatenate them with the path to the reference directory later. Our find will try to test all files under (and including) ., descending to subdirectories of any depth.

  • -type f is a test that checks if the currently considered file is a regular file. The purpose of this test is to avoid giving files of other types to cmp later. E.g. we don't want to use cmp with directories.

  • -print prints the pathname of the currently considered file. This is only to give indication of progress; you can omit -print if you want.

  • -exec test -f "$reference"/{} \; tests if there is a regular file under the same relative path in the reference directory. In the manual of GNU find -exec … ; is described as action, but it's also a test: it succeeds iff the called executable (here test) returns exit status 0, this is what we're relying on here. Our test is not only to avoid giving files of unexpected types to cmp later; it's also to:

    • avoid giving a nonexistent file to cmp;
    • avoid giving a symlink to cmp (see below).
  • -exec cmp -- {} "$reference"/{} \; is a test that actually compares the two files. Note if cmp is given a symlink and the target of the symlink then it will tell you the contents are identical. In the context of your question: if foo in the reference directory is a symlink to foo in the mutable directory then cmp will make us think there are two copies, while the only copy is in the mutable directory and if we blindly believe cmp then we will delete it. Not giving symlinks to cmp (see above) solves this problem.

  • -delete tries to delete the currently considered file. This action will be performed iff all the previous tests succeeded for the file.


Portability

AFAIK find in Cygwin is GNU find, it supports -delete which is a non-portable extension. GNU find also supports expanding more than one {} in -exec, as well as expanding {} concatenated with some string; these features are not portable. If you ever need a portable solution, use the below snippet. It's an alternative to the above, not an addition.

find . -type f \
       -exec sh -c '
          reference="$1"
          shift
          for f; do
             printf "%s\\n" "$f"
             test -f "$reference/$f" \
             && cmp -- "$f" "$reference/$f" \
             && rm -- "$f"
          done
       ' find-sh "$reference" {} +

Reasonable additions

Next you probably want to delete empty directories from the mutable directory:

find . -type d -empty -delete

-empty and -delete are not portable. It's relatively easy to replace -delete (with -depth + -exec rmdir -- {} \;), not so easy to replace -empty, I won't elaborate.

Maybe you also want to delete symlinks and such. The following command tries to delete files, excluding directories and regular files:

find . ! -type d ! -type f -delete

Now the mutable directory (i.e. our current working directory) contains only a minimal directory tree with regular files that are candidates for manual inspection.


Notes

  • In general there are race conditions (TOCTOU) that may allow a rogue user to make you delete a file in a wrong directory. E.g. see Race Conditions with -exec.

  • In many places I used --. If the paths in the variables are absolute and the starting point for find is . then -- is not really needed. I decided to use -- in case someone uses this answer as an inspiration and writes code where -- may actually be useful.

  • find-sh is explained here: What is the second sh in sh -c 'some shell code' sh?

1

The answer from Kamil Maciorowski is excellent.

Inspired by it, I have written a script for the 'find' command, which provides a little more comfort and error checking:

https://github.com/rdiez/Tools/tree/master/DeleteFilesIfDuplicatedInReferenceDir

rdiez
  • 11
  • 1