6

I have a folder full of files, with a number of duplicate files in it. Unfortunately, in a number of cases, one version is an updated version of another, so a straight byte-match doesn't locate the duplication. (I've looked at this question, but all the one I've looked at from the list seem to only do byte-count comparison...)

Are there any (Windows) dedup applications that can do a similarity-match and point the user to the files in question for examination? Freeware is good, free trial is acceptable. Even just a list of similarities to tell me where to look would probably work.

EDIT: Sorry, I should have mentioned; these are text-based files, primarily DOC, PPT and PDF. The most likely thing to have changed is the content, but formatting might differ as well. Even just picking up on text changes would probably be helpful though...

Margaret
  • 1,616

3 Answers3

3

You could try a plagiarism detector. Plagiarism and updates don't present exactly the same kind of similarities, so it may or may not give useful results, but there are a lot to choose from, so if one doesn't help, another might. I don't have a particular program to recommend; you could try to ask any teacher or professor you know (preferably outside computer science since they're more likely to be familiar with programming plagiarism than with natural language plagiarism).

1

Look for ssdeep and sdhash.

I've never tried sdhash but I read it's better than ssdeep. Anyway, both provide a CLI that allows the computation of fuzzy hashes and their respective similarities.

Should work fairly well for your goal.

PS: Sorry for the brevity and the lack of links but I'm mobile ATM.

Alix Axel
  • 1,192
0

I don't know of any applications, but if most of the content is the same between versions you could do a Windows Search on the directory with the "word or phrase in the file" option. Your query would be a particular phrase that doesn't change (or at least that you don't think changes) much between versions and is fairly unique to that particular document/set of documents. This type of search should work for PDF, DOC, and PPT despite the fact that they are not straight text files. This won't get you the exact output you're looking for, but if you choose your search phrase well and your content between versions doesn't vary wildly it should work pretty well.

Littleman
  • 21
  • 1