2

I have hundreds of thousands files distributed in many external disks and disks in computers and many are duplicated. This mess was caused by myself creating copies for safety proposal. From time to time I changed the directory structure of my organization but not replicated in other places where had copies.

Now, I have a single huge disk with almost all that I really need backups and mirrored in the cloud.

I would like a way to delete everything from all those distributed disks that is already in the big disk.

Let me show the scenarie:

OldDisk1:

/code/{manystructures}/{manyfiles}
/docs/{manystructures}/{manyfiles}

OldDisk2:

/dev/{another_structures}/{same_files_different_names}
/documents/{another_structures}/{same_files_different_names}

NewHugeDisk:

/home/username/code/{new_strutucture}/{new_files}
/home/username/documents/{new_strutucture}/{new_files}

Anyone know a tool or a way to do something like "find all files on OldDisk1 that are already in NewHugeDisk and delete"?

I looked at many tools (Windows, Mac and Linux as I have this issue on both) free and payed, but with no luck.

And ideia would be create a code to do that, but I'm not a developer. I can do small and simple codes, but this kind of code, I think would be to complicated for me.

I will appreciate any help or any ideas on this.

Greenonline
  • 2,390
Tuts
  • 33

3 Answers3

2

Assuming you can use Windows as an OS for the whole process and you don't like Free Duplicate File Finder (never tried it, but found it mentioned here), you could use PowerShell to achieve what you want with relatively little effort. Note: I'm not a real pro at PowerShell, so I'm pretty sure that one could refine my code.

Just open Powershell ISE (or, if you don't have that, use Notepad), copy&paste the following code to it and save the resulting file somewhere as *.ps1. You also have to change $oldpath's and $newpath's values to your directories - just put your paths between the quotes.

# Search-and-Destroy-script
# Get all files of both code-directories:
$oldpath = "Disk1:\code"
$newpath = "DiskNew:\code"

$files_old = Get-ChildItem -Path $oldpath -Recurse -File
$files_new = Get-ChildItem -Path $newpath -Recurse -File

for($i=0; $i -lt $files_old.length; $i++){
    $j=0
    while($true){
        # if last edit time is the same and file-size is the same...
        if($($files_old[$i]).length -eq $($files_new[$j]).length -and $($files_old[$i]).lastWriteTime -eq $($files_new[$j]).lastWriteTime){
            # Get File-Hashes for those files (SHA1 should be enough)
            $files_old_hash = Get-FileHash -Path $($files_old[$i]).FullName -Algorithm SHA1 | ForEach-Object {$_.Hash}
            $files_new_hash = Get-FileHash -Path $($files_new[$j]).FullName -Algorithm SHA1 | ForEach-Object {$_.Hash}
            # if hashes also are the same...
            if($files_old_hash -eq $files_new_hash){
                # remove the old file (-Confirm can be removed so you don't have to approve for every file)
                # if you want to check the files before deletion, you could also just rename them (here we're adding the suffix ".DUPLICATE"
                # Rename-Item -Path $($files_old[$i]).FullName -NewName "$($files_old[$i]).Name.DUPLICATE"
                Remove-Item -Path $($files_old[$i]).FullName -Confirm
                Write-Host "DELETING`t$($files_old[$i]).FullName" -ForegroundColor Red
                break
            }
        # if files aren't the same...
        }else{
            # if old_file is compared to all new_files, check next old file
            if($j -ge $files_new.length){
                break
            }
        }
        $j++
    }
}

Then start the script (via right-click, for example) - if that fails, make sure your ExecutionPolicy is set (https://superuser.com/a/106363/703240).

I use an almost identical script to check for files that were already copied (but possibly with changed names). This code assumes that only the names of the files are different, but not the content. The last edit time usually stays the same even after copying a file to a new path - unlike the creation time. If the content is different, my solution fails badly - you could use different unique attributes of files (but which?) or state that e.g. only files tat are smaller or older (considering the edit-time, again) than the new files should be deleted.

What the script does:

  1. Getting all files in the specified folders (and their subfolders)
  2. getting first old file (specified by $i)...
  3. comparing its last-edit-time and its file size with that of the first new file (specified by $j)...
  4. ...if they are equal, it calculates a file-hash to be sure that it is definitely the same file (arguably, this could be a bit too much effort for your goal)
  5. if hashes are equal, the old file gets deleted (and it will write which file in the terminal), then starting again at 2. with the next old file...
  6. if hashes are not equal (or last edit times don't equal or file-sizes don't equal) it starts again at 3. with the next new file.
flolilo
  • 2,831
1

rmlint is a command-line utility with options to do exactly what you want. It runs on Linux and macOS. The command you want is:

$ rmlint --progress \
    --must-match-tagged --keep-all-tagged \
    /mnt/OldDisk1 /mnt/OldDisk2 // /mnt/NewHugeDisk

This will find the duplicates you want. Instead of deleting them directly, it creates a shell script (./rmlint.sh) which you can review, optionally edit and then execute to do the desired deletion.

The '--progress' option gives you a nice progress indicator. The '//' separates 'untagged' fro 'tagged' paths; paths after '//' are considered 'tagged'. The '--must-match-tagged --keep-all-tagged' means only find files in untagged paths that have a copy in a tagged path.

You can also shorten that command using the short format of the options:

rmlint -g -m -k /mnt/OldDisk1 /mnt/OldDisk2 // /mnt/NewHugeDisk
0

Have you tried using third-party deduplication software?
I have tried cloudberry deduplication and it is really efficient as:

  • it has its own dedup mechanism to eliminate duplicate data thus saving a lot of storage space.
  • Another advantage of such tools is that they are more reliable and have a dedicated resource management technique.
yass
  • 2,574