11

I want to simply backup and archive the files on several machines. Unfortunately, the files have some large files that are the same file but stored differently on different machines. For instance, there may a few hundred photos that were copied from one computer to the other as an ad-hoc backup. Now that I want to make a common repository of files, I don't want several copies of the same photo.

If I copy all of these files to a single directory, is there a tool that can go thru and recognize duplicate files and give me a list or even delete one of the duplicates?

User1
  • 9,701

7 Answers7

4

Create an md5sum of each file, duplicates md5sums suggest (but doesn't guarantee) duplicate files.

bryan
  • 8,528
  • 4
  • 30
  • 42
3

You could use dupemerge to turn the identical files into hardlinks. It'll take a very long time on a large file set though. SHA (or MD5) hashes of the files will almost certainly work faster, but you'll have to do more legwork in finding the duplicates. The probability of accidental collision is so low that in reality you can ignore it. (In fact, many deduplication products already do this.)

Your best bet for dealing with photos and music is to get tools tailored to finding duplicates of those items in particular. Especially since you may not have files that are identical at a binary level after things like tagging or cropping or encoding differences come into play. You'll want tools that can find photos that "look" the same and music that "sounds" the same even if minor adjustments have been made to the files.

afrazier
  • 23,505
1

Well, if you have the ability, you can set up a deduplicating filesystem and put your backups on that. This will not only deduplicate whole files, but also similar pieces of files. For example, if you have the same JPEG in several places, but with different EXIF tags on each version, a deduplicating filesystem would only store the image data once.

Deduplicating filesystems include lessfs, ZFS, and SDFS.

0

Hard links only perform deduplication if the entire file is identical. If headers (EXIF, ID3, …) or metadata (owner) differ, they will not be linked.

When you have a chance of using a file system with block deduplication support (ZFS, btrfs, …) use that instead. I am very fond of the offline (aka batch) dedup support of btrfs, which supports extent-level deduplication and does not constantly consume huge amounts of memory (as ZFS online dedup).

Deduplication also has the advantage that files can be modified by the user without the other copy noticing (which might not be applicable in your case, but in others).

See https://btrfs.wiki.kernel.org/index.php/Deduplication for an excellent discussion.

0

When I was doing this kind of thing, I learned that it's a lot more engaging/time-efficient to actually just go through the files yourself in your free time, over the course of a couple weeks. You can tell the difference between things way better than your computer can.

If you don't agree, then I suggest EasyDuplicateFinder. As I mentioned above, though, it'll take a long time, say, about a day for 5GB of files.

And on another note, Crashplan does what you were doing before, but in a much more organized, non-versioning-problem way.

digitxp
  • 14,884
0

Another possibility, presuming the machines you're backing-up will support it, is to use something like rsync.

If you rsync from A to B, then from C to B, then from D to B, etc, exact duplicates (ie, by filename) will be eliminated (and synchronized between the machines you're backing up).

If you don't want them all synchronized to each other, however, this is not the best way to go.

warren
  • 10,322
0

For image files, use findimagedupes. It's also packaged in debian.

cweiske
  • 2,191