35

I downloaded a lot of images in a directory.
Downloader renamed files which already exist.
I also renamed some of the files manually.

a.jpg
b.jpg
b(2).jpg
hello.jpg      <-- renamed from `b(3).jpg` manually
c.jpg
c(2).jpg
world.jpg      <-- renamed from `d.jpg` manually
d(2).jpg
d(3).jpg

How to remove duplicated ones? The result should be:

a.jpg
b.jpg
c.jpg
world.jpg

note: name doesn't matter. I just want uniq files.

kev
  • 13,200

17 Answers17

72

fdupes is the tool of your choice. To find all duplicate files (by content, not by name) in the current directory:

fdupes -r .

To manually confirm deletion of duplicated files:

fdupes -r -d .

To automatically delete all copies but the first of each duplicated file (be warned, this warning, this actually deletes files, as requested):

fdupes -r -f . | grep -v '^$' | xargs rm -v

I'd recommend to manually check files before deletion:

fdupes -rf . | grep -v '^$' > files
... # check files
xargs -a files rm -v
Jakob
  • 951
35

bash 4.x

#!/bin/bash
declare -A arr
shopt -s globstar

for file in **; do [[ -f "$file" ]] || continue

read cksm _ < <(md5sum "$file") if ((arr[$cksm]++)); then echo "rm $file" fi done

This is both recursive and handles any file name. Downside is that it requires version 4.x for the ability to use associative arrays and recursive searching. Remove the echo if you like the results.

gawk version

gawk '
  {
    cmd="md5sum " q FILENAME q
    cmd | getline cksm
    close(cmd)
    sub(/ .*$/,"",cksm)
    if(a[cksm]++){
      cmd="echo rm " q FILENAME q
      system(cmd)
      close(cmd)
    }
    nextfile
  }' q='"' *

Note that this will still break on files that have double-quotes in their name. No real way to get around that with awk. Remove the echo if you like the results.

SiegeX
  • 2,467
3

I recommend fclones.

Fclones is a modern duplicate file finder and remover written in Rust, available on most Linux distros and macOS.

Notable features:

  • supports spaces, non-ASCII and control characters in file paths
  • allows to search in multiple directory trees
  • respects .gitignore files
  • safe: allows to inspect the list of duplicates manually before performing any action on them
  • offers plenty of options for filtering / selecting files to remove or preserve
  • very fast

To search for duplicates in the current directory simply run:

fclones group . >dupes.txt

Then you can inspect the dupes.txt file to check if it found the right duplicates (you can also modify that list to your liking).

Finally remove/link/move the duplicate files with one of:

fclones remove <dupes.txt
fclones link <dupes.txt
fclones move target <dupes.txt
fclones dedupe <dupes.txt   # copy-on-write deduplication on some filesystems

Example:

pkolaczk@p5520:~/Temp$ mkdir files
pkolaczk@p5520:~/Temp$ echo foo >files/foo1.txt
pkolaczk@p5520:~/Temp$ echo foo >files/foo2.txt
pkolaczk@p5520:~/Temp$ echo foo >files/foo3.txt

pkolaczk@p5520:~/Temp$ fclones group files >dupes.txt [2022-05-13 18:48:25.608] fclones: info: Started grouping [2022-05-13 18:48:25.613] fclones: info: Scanned 4 file entries [2022-05-13 18:48:25.613] fclones: info: Found 3 (12 B) files matching selection criteria [2022-05-13 18:48:25.614] fclones: info: Found 2 (8 B) candidates after grouping by size [2022-05-13 18:48:25.614] fclones: info: Found 2 (8 B) candidates after grouping by paths and file identifiers [2022-05-13 18:48:25.619] fclones: info: Found 2 (8 B) candidates after grouping by prefix [2022-05-13 18:48:25.620] fclones: info: Found 2 (8 B) candidates after grouping by suffix [2022-05-13 18:48:25.620] fclones: info: Found 2 (8 B) redundant files

pkolaczk@p5520:~/Temp$ cat dupes.txt

Report by fclones 0.24.0

Timestamp: 2022-05-13 18:48:25.621 +0200

Command: fclones group files

Base dir: /home/pkolaczk/Temp

Total: 12 B (12 B) in 3 files in 1 groups

Redundant: 8 B (8 B) in 2 files

Missing: 0 B (0 B) in 0 files

6109f093b3fd5eb1060989c990d1226f, 4 B (4 B) * 3: /home/pkolaczk/Temp/files/foo1.txt /home/pkolaczk/Temp/files/foo2.txt /home/pkolaczk/Temp/files/foo3.txt

pkolaczk@p5520:~/Temp$ fclones remove <dupes.txt [2022-05-13 18:48:41.002] fclones: info: Started deduplicating [2022-05-13 18:48:41.003] fclones: info: Processed 2 files and reclaimed 8 B space

pkolaczk@p5520:~/Temp$ ls files foo1.txt

2

Here is one-liners (based on answer from Prashant Lakhera).

Preview:

find . -type f | xargs -I {} md5sum "{}" | sort -k1 | uniq -w32 -d | cut -d" " -f2- | xargs -I {} echo "{}"

Remove:

find . -type f | xargs -I {} md5sum "{}" | sort -k1 | uniq -w32 -d | cut -d" " -f2- | xargs -I {} rm -f "{}"

And here is slightly more complex version, that tries to preserve files that resides deeper in directory tree and have longer filenames. Presumably this files have been sorted manually.

find . -type f | xargs -I {} md5sum "{}" | awk '{print gsub("/","/",$0), length, $0}' | sort -k3,3 -k2,2n -k1,1n | cut -d" " -f3- | uniq -w32 -d | cut -d" " -f2- | xargs -I {} rm -f "{}"

Drawback: if you have more than 2 copies of a file you have to launch command multiple times.

2

You can try FSLint. It has both command line and GUI interface.

Bibhas
  • 2,596
1

More concise version of removing duplicated files(just one line)

young@ubuntu-16:~/test$ md5sum `find ./ -type f` | sort -k1 | uniq -w32 -d | xargs rm -fv

find_same_size.sh

#!/usr/bin/env bash
#set -x
#This is small script can find same size of files.
find_same_size(){

if [[ -z $1 || ! -d $1 ]]
then
echo "Usage $0 directory_name" ;
 exit $?
else
dir_name=$1;
echo "current directory is $1"



for i in $(find $dir_name -type f); do
   ls -fl $i
done | awk '{f=""
        if(NF>9)for(i=9;i<=NF;i++)f=f?f" "$i:$i; else f=$9;
        if(a[$5]){ a[$5]=a[$5]"\n"f; b[$5]++;} else a[$5]=f} END{for(x     in b)print a[x] }' | xargs stat -c "%s  %n" #For just list files
 fi
   }

find_same_size $1


young@ubuntu-16:~/test$ bash find_same_size.sh tttt/ | awk '{ if($1 !~   /^([[:alpha:]])+/) print $2}' | xargs md5sum | uniq -w32 -d | xargs rm -vf
1

How to test files having unique content?

if diff "$file1" "$file2" > /dev/null; then
    ...

How can we get list of files in directory?

files="$( find ${files_dir} -type f )"

We can get any 2 files from that list and check if their names are different and content are same.

#!/bin/bash
# removeDuplicates.sh

files_dir=$1
if [[ -z "$files_dir" ]]; then
    echo "Error: files dir is undefined"
fi

files="$( find ${files_dir} -type f )"
for file1 in $files; do
    for file2 in $files; do
        # echo "checking $file1 and $file2"
        if [[ "$file1" != "$file2" && -e "$file1" && -e "$file2" ]]; then
            if diff "$file1" "$file2" > /dev/null; then
                echo "$file1 and $file2 are duplicates"
                rm -v "$file2"
            fi
        fi
    done
done

For example, we have some dir:

$> ls .tmp -1
all(2).txt
all.txt
file
text
text(2)

So there are only 3 unique files.

Lets run that script:

$> ./removeDuplicates.sh .tmp/
.tmp/text(2) and .tmp/text are duplicates
removed `.tmp/text'
.tmp/all.txt and .tmp/all(2).txt are duplicates
removed `.tmp/all(2).txt'

And we get only 3 files leaved.

$> ls .tmp/ -1
all.txt
file
text(2)
1

I wrote this tiny script to delete duplicated files

https://gist.github.com/crodas/d16a16c2474602ad725b

Basically it uses a temporary file (/tmp/list.txt) to create a map of files and their hashes. Later I use that files and the magic of Unix pipes to do the rest.

The script won't delete anything but will print the commands to delete files.

mfilter.sh ./dir | bash

Hope it helps

crodas
  • 111
0

I found an easier way to perform the same task

for i in `md5sum * | sort -k1 | uniq -w32 -d|awk '{print $2}'`; do
rm -rf $i
done
0

Most and possibly all of the remaining answers are terribly inefficient by computing the checksum of each and every file in the directory to process.

A potentially orders of magnitude faster approach is to first get the size of each file, which is almost immediate (ls or stat), and then compute and compare the checksums only for the files having a non unique size and only keep one instance of the files sharing both their size and checksum.

Note that even though theoretically hash collisions can occur, there are not enough jpeg files on the entire Internet for a hash collision to reasonably have a chance to happen. Two files sharing both their size and checksum are identical for all intents and purposes.

See: How reliable are SHA1 sum and MD5 sums on very large files?

jlliagre
  • 14,369
0

This is not what you are asking, but I think someone might find it useful when the checksums are not the same, but the name is similar (with suffix in parentheses). This script removes the files with suffixes as ("digit")

#! /bin/bash
# Warning: globstar excludes hidden directories.
# Turn on recursive globbing (in this script) or exit if the option is not supported:
shopt -s globstar || exit
for f in **
do
extension="${f##*.}"
#get only files with parentheses suffix
FILEWITHPAR=$( echo "${f%.*}".$extension | grep -o -P "(.*\([0-9]\)\..*)")
# print file to be possibly deleted
if [ -z "$FILEWITHPAR" ] ;then
:
else
echo "$FILEWITHPAR ident"
# identify if a similar file without suffix exists
FILENOPAR=$(echo $FILEWITHPAR | sed -e 's/^\(.*\)([0-9])\(.*\).*/\1\2/')
echo "$FILENOPAR exists?"
if [ -f "$FILENOPAR" ]; then
#delete file with suffix in parentheses
echo ""$FILEWITHPAR" to be deleted"
rm -Rf "$FILEWITHPAR"
else
echo "no"
fi
fi
done
Ferroao
  • 220
0

There is a beautiful solution on https://stackoverflow.com/questions/57736996/how-to-remove-duplicate-files-in-linux/57737192#57737192:

md5sum prime-* | awk 'n[$1]++' | cut -d " " -f 3- | xargs rm

Another very clear and nice solution is mentioned on https://unix.stackexchange.com/questions/192701/how-to-remove-duplicate-files-using-bash:

md5sum * | sort -k1 | uniq -w 32 -d
0

Here is an alternate version which runs on Mac , example to filter set to *.png

Preview

md5sum *.png | sort | awk 'BEGIN { val=$1} { if ($1 == val) print $2; val=$1}'

Delete

 md5sum *.png | sort | awk 'BEGIN { val=$1} { if ($1 == val) print $2; val=$1}' | xargs -I X rm -f "X"

Examples of creating aliases for csh/tcsh

alias lsdup "md5sum \!:1 | sort | awk 'BEGIN { val="\$"1} { if ( "\$"1 == val ) print "\$"2 ; val="\$"1}'"
alias rmdup "md5sum \!:1 | sort | awk 'BEGIN { val="\$"1} { if ( "\$"1 == val ) print "\$"2 ; val="\$"1}'| xargs -I X rm -fv X"
0

try cldup, https://github.com/jkzhang2019/cldup

I have images of almost 1TB that collected over ten years. To remove duplicate files I created this project and wit works well.

With cldup you can create a database of all your files and identify any duplicate file with one line of command.

The database can be maintained increasingly, so once a database was created, you can updated it in minutes.

0

Deduplicator - Find, Sort, Filter & Delete duplicate files

Examples:

# Scan for duplicates recursively from the current dir, only look for png, jpg & pdf file types & interactively delete files
deduplicator -t pdf,jpg,png -i

Scan for duplicates recursively from the ~/Pictures dir, only look for png, jpeg, jpg & pdf file types & interactively delete files

deduplicator ~/Pictures/ -t png,jpeg,jpg,pdf -i

Scan for duplicates in the ~/Pictures without recursing into subdirectories

deduplicator ~/Pictures --max-depth 0

look for duplicates in the ~/.config directory while also recursing into symbolic link paths

deduplicator ~/.config --follow-links

scan for duplicates that are greater than 100mb in the ~/Media directory

deduplicator ~/Media --min-size 100mb

kev
  • 13,200
0

This one-liner will show each set with a blank line in between. You can always start a set answering "n" (no) and keep answering "y" (yes) to the other files in the set:

xargs -r -d '\n' -a <(fdupes -r . | grep -vE "/.git(hub|lab)?/" | uniq) bash -c 'for file in "$@"; do if [ -z "$file" ]; then echo; else rm -vfi -- "$file"; fi; done' _

The grep part removes git repositories and github/gitlab configs, a handy improvement, since we don't really want to remove things like "HEAD" inside the .git subdir, and similar files.

Using uniq instead of grep -v '^$' removes only the duplicated lines, keeping one single blank line separating the sets.

DrBeco
  • 2,125
-1

I found a small program that really simplifies this kind of tasks: fdupes.