How to remove duplicated files in a directory?

Question

I downloaded a lot of images in a directory.
Downloader renamed files which already exist.
I also renamed some of the files manually.

a.jpg
b.jpg
b(2).jpg
hello.jpg      <-- renamed from `b(3).jpg` manually
c.jpg
c(2).jpg
world.jpg      <-- renamed from `d.jpg` manually
d(2).jpg
d(3).jpg

How to remove duplicated ones? The result should be:

a.jpg
b.jpg
c.jpg
world.jpg

note: name doesn't matter. I just want uniq files.

score 72 · Answer 1 · answered Dec 21 '13 at 14:44

fdupes is the tool of your choice. To find all duplicate files (by content, not by name) in the current directory:

fdupes -r .

To manually confirm deletion of duplicated files:

fdupes -r -d .

To automatically delete all copies but the first of each duplicated file (be warned, this warning, this actually deletes files, as requested):

fdupes -r -f . | grep -v '^$' | xargs rm -v

I'd recommend to manually check files before deletion:

fdupes -rf . | grep -v '^$' > files
... # check files
xargs -a files rm -v

score 35 · Accepted Answer · edited Jun 12 '20 at 13:48

bash 4.x

#!/bin/bash
declare -A arr
shopt -s globstar
for file in **; do
  [[ -f "$file" ]] || continue
read cksm _ < <(md5sum "$file")
  if ((arr[$cksm]++)); then 
    echo "rm $file"
  fi
done

This is both recursive and handles any file name. Downside is that it requires version 4.x for the ability to use associative arrays and recursive searching. Remove the echo if you like the results.

gawk version

gawk '
  {
    cmd="md5sum " q FILENAME q
    cmd | getline cksm
    close(cmd)
    sub(/ .*$/,"",cksm)
    if(a[cksm]++){
      cmd="echo rm " q FILENAME q
      system(cmd)
      close(cmd)
    }
    nextfile
  }' q='"' *

Note that this will still break on files that have double-quotes in their name. No real way to get around that with awk. Remove the echo if you like the results.

score 3 · Answer 3 · answered May 13 '22 at 16:45

I recommend fclones.

Fclones is a modern duplicate file finder and remover written in Rust, available on most Linux distros and macOS.

Notable features:

supports spaces, non-ASCII and control characters in file paths
allows to search in multiple directory trees
respects .gitignore files
safe: allows to inspect the list of duplicates manually before performing any action on them
offers plenty of options for filtering / selecting files to remove or preserve
very fast

To search for duplicates in the current directory simply run:

fclones group . >dupes.txt

Then you can inspect the dupes.txt file to check if it found the right duplicates (you can also modify that list to your liking).

Finally remove/link/move the duplicate files with one of:

fclones remove <dupes.txt
fclones link <dupes.txt
fclones move target <dupes.txt
fclones dedupe <dupes.txt   # copy-on-write deduplication on some filesystems

Example:

pkolaczk@p5520:~/Temp$ mkdir files
pkolaczk@p5520:~/Temp$ echo foo >files/foo1.txt
pkolaczk@p5520:~/Temp$ echo foo >files/foo2.txt
pkolaczk@p5520:~/Temp$ echo foo >files/foo3.txt
pkolaczk@p5520:~/Temp$ fclones group files >dupes.txt
[2022-05-13 18:48:25.608] fclones:  info: Started grouping
[2022-05-13 18:48:25.613] fclones:  info: Scanned 4 file entries
[2022-05-13 18:48:25.613] fclones:  info: Found 3 (12 B) files matching selection criteria
[2022-05-13 18:48:25.614] fclones:  info: Found 2 (8 B) candidates after grouping by size
[2022-05-13 18:48:25.614] fclones:  info: Found 2 (8 B) candidates after grouping by paths and file identifiers
[2022-05-13 18:48:25.619] fclones:  info: Found 2 (8 B) candidates after grouping by prefix
[2022-05-13 18:48:25.620] fclones:  info: Found 2 (8 B) candidates after grouping by suffix
[2022-05-13 18:48:25.620] fclones:  info: Found 2 (8 B) redundant files
pkolaczk@p5520:~/Temp$ cat dupes.txt
Report by fclones 0.24.0
Timestamp: 2022-05-13 18:48:25.621 +0200
Command: fclones group files
Base dir: /home/pkolaczk/Temp
Total: 12 B (12 B) in 3 files in 1 groups
Redundant: 8 B (8 B) in 2 files
Missing: 0 B (0 B) in 0 files
6109f093b3fd5eb1060989c990d1226f, 4 B (4 B) * 3:
    /home/pkolaczk/Temp/files/foo1.txt
    /home/pkolaczk/Temp/files/foo2.txt
    /home/pkolaczk/Temp/files/foo3.txt
pkolaczk@p5520:~/Temp$ fclones remove <dupes.txt
[2022-05-13 18:48:41.002] fclones:  info: Started deduplicating
[2022-05-13 18:48:41.003] fclones:  info: Processed 2 files and reclaimed 8 B space
pkolaczk@p5520:~/Temp$ ls files
foo1.txt

Артём Хакимов · Answer 4 · 2022-10-12T11:56:26.683

Here is one-liners (based on answer from Prashant Lakhera).

Preview:

find . -type f | xargs -I {} md5sum "{}" | sort -k1 | uniq -w32 -d | cut -d" " -f2- | xargs -I {} echo "{}"

Remove:

find . -type f | xargs -I {} md5sum "{}" | sort -k1 | uniq -w32 -d | cut -d" " -f2- | xargs -I {} rm -f "{}"

And here is slightly more complex version, that tries to preserve files that resides deeper in directory tree and have longer filenames. Presumably this files have been sorted manually.

find . -type f | xargs -I {} md5sum "{}" | awk '{print gsub("/","/",$0), length, $0}' | sort -k3,3 -k2,2n -k1,1n | cut -d" " -f3- | uniq -w32 -d | cut -d" " -f2- | xargs -I {} rm -f "{}"

Drawback: if you have more than 2 copies of a file you have to launch command multiple times.

score 2 · Answer 5 · answered Feb 05 '12 at 11:10

2

You can try FSLint. It has both command line and GUI interface.

answered Feb 05 '12 at 11:10

Bibhas

2,596

niceguy oh · Answer 6 · 2016-08-29T13:50:19.047

More concise version of removing duplicated files(just one line)

young@ubuntu-16:~/test$ md5sum `find ./ -type f` | sort -k1 | uniq -w32 -d | xargs rm -fv

find_same_size.sh

#!/usr/bin/env bash
#set -x
#This is small script can find same size of files.
find_same_size(){

if [[ -z $1 || ! -d $1 ]]
then
echo "Usage $0 directory_name" ;
 exit $?
else
dir_name=$1;
echo "current directory is $1"



for i in $(find $dir_name -type f); do
   ls -fl $i
done | awk '{f=""
        if(NF>9)for(i=9;i<=NF;i++)f=f?f" "$i:$i; else f=$9;
        if(a[$5]){ a[$5]=a[$5]"\n"f; b[$5]++;} else a[$5]=f} END{for(x     in b)print a[x] }' | xargs stat -c "%s  %n" #For just list files
 fi
   }

find_same_size $1


young@ubuntu-16:~/test$ bash find_same_size.sh tttt/ | awk '{ if($1 !~   /^([[:alpha:]])+/) print $2}' | xargs md5sum | uniq -w32 -d | xargs rm -vf

ДМИТРИЙ МАЛИКОВ · Answer 7 · 2012-02-05T11:22:34.483

How to test files having unique content?

if diff "$file1" "$file2" > /dev/null; then
    ...

How can we get list of files in directory?

files="$( find ${files_dir} -type f )"

We can get any 2 files from that list and check if their names are different and content are same.

#!/bin/bash
# removeDuplicates.sh

files_dir=$1
if [[ -z "$files_dir" ]]; then
    echo "Error: files dir is undefined"
fi

files="$( find ${files_dir} -type f )"
for file1 in $files; do
    for file2 in $files; do
        # echo "checking $file1 and $file2"
        if [[ "$file1" != "$file2" && -e "$file1" && -e "$file2" ]]; then
            if diff "$file1" "$file2" > /dev/null; then
                echo "$file1 and $file2 are duplicates"
                rm -v "$file2"
            fi
        fi
    done
done

For example, we have some dir:

$> ls .tmp -1
all(2).txt
all.txt
file
text
text(2)

So there are only 3 unique files.

Lets run that script:

$> ./removeDuplicates.sh .tmp/
.tmp/text(2) and .tmp/text are duplicates
removed `.tmp/text'
.tmp/all.txt and .tmp/all(2).txt are duplicates
removed `.tmp/all(2).txt'

And we get only 3 files leaved.

$> ls .tmp/ -1
all.txt
file
text(2)

score 1 · Answer 8 · answered May 28 '13 at 03:00

I wrote this tiny script to delete duplicated files

https://gist.github.com/crodas/d16a16c2474602ad725b

Basically it uses a temporary file (/tmp/list.txt) to create a map of files and their hashes. Later I use that files and the magic of Unix pipes to do the rest.

The script won't delete anything but will print the commands to delete files.

mfilter.sh ./dir | bash

Hope it helps

score 0 · Answer 9 · answered Nov 21 '15 at 06:40

0

I found an easier way to perform the same task

for i in `md5sum * | sort -k1 | uniq -w32 -d|awk '{print $2}'`; do
rm -rf $i
done

answered Nov 21 '15 at 06:40

Prashant Lakhera

101

jlliagre · Answer 10 · 2021-02-27T23:56:50.857

Most and possibly all of the remaining answers are terribly inefficient by computing the checksum of each and every file in the directory to process.

A potentially orders of magnitude faster approach is to first get the size of each file, which is almost immediate (ls or stat), and then compute and compare the checksums only for the files having a non unique size and only keep one instance of the files sharing both their size and checksum.

Note that even though theoretically hash collisions can occur, there are not enough jpeg files on the entire Internet for a hash collision to reasonably have a chance to happen. Two files sharing both their size and checksum are identical for all intents and purposes.

See: How reliable are SHA1 sum and MD5 sums on very large files?

score 0 · Answer 11 · answered Dec 12 '17 at 13:22

This is not what you are asking, but I think someone might find it useful when the checksums are not the same, but the name is similar (with suffix in parentheses). This script removes the files with suffixes as ("digit")

#! /bin/bash
# Warning: globstar excludes hidden directories.
# Turn on recursive globbing (in this script) or exit if the option is not supported:
shopt -s globstar || exit
for f in **
do
extension="${f##*.}"
#get only files with parentheses suffix
FILEWITHPAR=$( echo "${f%.*}".$extension | grep -o -P "(.*\([0-9]\)\..*)")
# print file to be possibly deleted
if [ -z "$FILEWITHPAR" ] ;then
:
else
echo "$FILEWITHPAR ident"
# identify if a similar file without suffix exists
FILENOPAR=$(echo $FILEWITHPAR | sed -e 's/^\(.*\)([0-9])\(.*\).*/\1\2/')
echo "$FILENOPAR exists?"
if [ -f "$FILENOPAR" ]; then
#delete file with suffix in parentheses
echo ""$FILEWITHPAR" to be deleted"
rm -Rf "$FILEWITHPAR"
else
echo "no"
fi
fi
done

score 0 · Answer 12 · answered Dec 02 '22 at 13:34

There is a beautiful solution on https://stackoverflow.com/questions/57736996/how-to-remove-duplicate-files-in-linux/57737192#57737192:

md5sum prime-* | awk 'n[$1]++' | cut -d " " -f 3- | xargs rm

Another very clear and nice solution is mentioned on https://unix.stackexchange.com/questions/192701/how-to-remove-duplicate-files-using-bash:

md5sum * | sort -k1 | uniq -w 32 -d

score 0 · Answer 13 · answered Jun 07 '23 at 16:26

Here is an alternate version which runs on Mac , example to filter set to *.png

Preview

md5sum *.png | sort | awk 'BEGIN { val=$1} { if ($1 == val) print $2; val=$1}'

Delete

 md5sum *.png | sort | awk 'BEGIN { val=$1} { if ($1 == val) print $2; val=$1}' | xargs -I X rm -f "X"

Examples of creating aliases for csh/tcsh

alias lsdup "md5sum \!:1 | sort | awk 'BEGIN { val="\$"1} { if ( "\$"1 == val ) print "\$"2 ; val="\$"1}'"
alias rmdup "md5sum \!:1 | sort | awk 'BEGIN { val="\$"1} { if ( "\$"1 == val ) print "\$"2 ; val="\$"1}'| xargs -I X rm -fv X"

score 0 · Answer 14 · answered Jun 15 '23 at 01:16

try cldup, https://github.com/jkzhang2019/cldup

I have images of almost 1TB that collected over ten years. To remove duplicate files I created this project and wit works well.

With cldup you can create a database of all your files and identify any duplicate file with one line of command.

The database can be maintained increasingly, so once a database was created, you can updated it in minutes.

score 0 · Answer 15 · answered Feb 04 '24 at 06:38

Deduplicator - Find, Sort, Filter & Delete duplicate files

Examples:

# Scan for duplicates recursively from the current dir, only look for png, jpg & pdf file types & interactively delete files
deduplicator -t pdf,jpg,png -i
Scan for duplicates recursively from the ~/Pictures dir, only look for png, jpeg, jpg & pdf file types & interactively delete files
deduplicator ~/Pictures/ -t png,jpeg,jpg,pdf -i
Scan for duplicates in the ~/Pictures without recursing into subdirectories
deduplicator ~/Pictures --max-depth 0
look for duplicates in the ~/.config directory while also recursing into symbolic link paths
deduplicator ~/.config --follow-links
scan for duplicates that are greater than 100mb in the ~/Media directory
deduplicator ~/Media --min-size 100mb

score 0 · Answer 16 · answered May 31 '25 at 14:40

This one-liner will show each set with a blank line in between. You can always start a set answering "n" (no) and keep answering "y" (yes) to the other files in the set:

xargs -r -d '\n' -a <(fdupes -r . | grep -vE "/.git(hub|lab)?/" | uniq) bash -c 'for file in "$@"; do if [ -z "$file" ]; then echo; else rm -vfi -- "$file"; fi; done' _

The grep part removes git repositories and github/gitlab configs, a handy improvement, since we don't really want to remove things like "HEAD" inside the .git subdir, and similar files.

Using uniq instead of grep -v '^$' removes only the duplicated lines, keeping one single blank line separating the sets.

score -1 · Answer 17 · edited Jun 12 '20 at 13:48

-1

I found a small program that really simplifies this kind of tasks: fdupes.

edited Jun 12 '20 at 13:48

Community

1

answered Sep 21 '18 at 02:48

Ricky Neff

205

How to remove duplicated files in a directory?

17 Answers17

bash 4.x

gawk version

Report by fclones 0.24.0

Timestamp: 2022-05-13 18:48:25.621 +0200

Command: fclones group files

Base dir: /home/pkolaczk/Temp

Total: 12 B (12 B) in 3 files in 1 groups

Redundant: 8 B (8 B) in 2 files

Missing: 0 B (0 B) in 0 files

find_same_size.sh

Scan for duplicates recursively from the ~/Pictures dir, only look for png, jpeg, jpg & pdf file types & interactively delete files

Scan for duplicates in the ~/Pictures without recursing into subdirectories

look for duplicates in the ~/.config directory while also recursing into symbolic link paths

scan for duplicates that are greater than 100mb in the ~/Media directory

Linked