In Linux with limited features (NAS box), how can I find duplicate files, then remove duplicates and substitute hardlinks to one file?

Question

I have a NAS box running some version of Linux that I use for backing up anything and everything.

It is essentially an absolute certainty that some of the files are identical duplicates.

That being the case, what I want to do is:

Identify duplicate files where "duplicate" = identical SHA256 checksums. (Identical SHA512 is also acceptable but might take much longer. Which do you suggest?)
Allow one to be the "master" copy, remove all other copies, and substitute hard links to the one remaining copy. This should free up a considerable amount of space on the NAS volume.

Note that the first file found is a good choice for the "master" file and all others can be removed and hard-linked to it. Permissions and ownership isn't a problem because there's only one user and (don't hate on me here), it's all wide open permission-wise anyway.

Also note that I want hard links so that if I delete a file, (for whatever reason), all the others remain.

Note that I have console access to the NAS box through a SSH shell.

Question:

Is it possible ?
How do I do it ?
If a file has "X" number of hard-links to it, and I delete the original file that everyone else is hard linked to, do the remaining hard links remain to a real file? (I suspect the answer is "yes" such that the one file remains until all hard-links are removed.)

Update to add additional context:

The NAS box has two drives, one of which is external, can be removed, and can be processed on my Ubuntu laptop.

The other one is internal and is essentially untouchable. Though I can remove it, it is set up in a very custom way and dinking with it is the fast boat to disaster.

Additionally the internal O/S is a "network appliance" version of Linux running BusyBox. It appears to implement the standard functions including things like find, grep, sed, awk, etc.

Viz.: (as returned by busybox --help)

Currently defined functions:
        adjtimex, ar, arp, arping, ash, awk, basename, cat, chgrp, chmod,
        chown, chroot, clear, cmp, cp, crond, crontab, cut, date, dd, df,
        dhcprelay, diff, dirname, dmesg, dnsdomainname, dos2unix, dpkg,
        dumpleases, echo, egrep, env, expand, expr, false, fgrep, find, free,
        fsck, getopt, getty, grep, halt, head, hostname, id, ifconfig,
        ifenslave, init, ionice, ipcalc, kill, killall, ln, logger, logname,
        logread, losetup, ls, lsof, lspci, lsusb, md5sum, mdev, mkdir, mkfifo,
        mknod, mkswap, mktemp, modprobe, more, mount, mv, nice, nohup,
        nslookup, pidof, ping, ping6, pivot_root, poweroff, printenv, printf,
        ps, pwd, rdate, readlink, reboot, renice, rm, rmdir, route, sed, seq,
        sh, sleep, sort, split, stat, swapoff, swapon, sync, sysctl, syslogd,
        tac, tail, tar, tee, tftp, tftpd, top, touch, tr, traceroute,
        traceroute6, true, tty, udhcpc, udhcpd, umount, uname, uniq, unix2dos,
        uptime, usleep, vconfig, vi, watch, wc, which, xargs, zcip

Kamil Maciorowski · Accepted Answer · 2024-04-05T17:52:56.197

1

Is it possible?

Yes, but not with sha256sum because it seems you don't have this tool. You do have md5sum.

2

How do I do it?

Use the below script. Save it to a file (e.g. deduplicate), make it exectutable (chmod +x deduplicate). The script needs proper input and an argument to do its job right, so read the whole answer before you run it.

Script

#/bin/sh
exec 3>/dev/tty
if echo | head -c 1 >/dev/null 2>/dev/null; then
   fhead () { head -c 8192; }
else
   fhead () { dd bs=1 count=8192 2>/dev/null; }
fi
grep -v '~DEDUPED~$' 

| (
while IFS= read -r pathname; do
   wc -c -- "$pathname"
   printf . >&3
done
) | sort -n 

| (
sequence=
oldsize=
oldpathname=
while IFS= read -r line; do
   size="${line%% }"
   pathname="${line# }"
   if [ "$size" = "$oldsize" ]; then
      if [ -z "$sequence" ]; then
         sequence=y
         printf '%s %s\n' "$(<"$oldpathname" fhead | md5sum -b)" "$oldpathname"
         printf '+' >&3
      fi
      printf '%s %s\n' "$(<"$pathname" fhead | md5sum -b)" "$pathname"
      printf '+' >&3
   else
      oldsize="$size"
      oldpathname="$pathname"
      sequence=
      printf '-' >&3
   fi
done
) | sort -k 1,1 

| (
sequence=
oldsum=
oldpathname=
while IFS= read -r line; do
   sum="${line%% }"
   pathname="${line# ?- }"
   if [ "$sum" = "$oldsum" ]; then
      if [ -z "$sequence" ]; then
         sequence=y
         md5sum -b -- "$oldpathname"
         printf '#' >&3
      fi
      md5sum -b -- "$pathname"
      printf '#' >&3
   else
      oldsum="$sum"
      oldpathname="$pathname"
      sequence=
      printf '=' >&3
   fi
done
echo >&3
) | sort -k 1,1 

| (
oldsum=
oldpathname=
while IFS= read -r line; do
   sum="${line%% }"
   pathname="${line# ?}"
   if [ "$sum" = "$oldsum" ]; then
      if [ "$1" = --no-dry-run ]; then
         mv -- "$pathname" "$pathname~DEDUPED~" 

         && ln -f -- "$oldpathname" "$pathname" 

         && printf '%s ----> %s\n' "$pathname" "$oldpathname"
      else
         printf '%s ----> %s\n' "$pathname" "$oldpathname"
      fi
   else
      oldsum="$sum"
      oldpathname="$pathname"
   fi
done )

Usage

The script is designed to read pathnames from find. Caveats:

Because hardlinking only works within a single filesystem, it's advised to run the script for one filesystem (or a part of it) at a time.
You should give the script only pathnames to regular files.
Deduplicating empty files will not give you much space (if anything), so you can exclude them.
Since I don't know if your toolset supports working with null-terminated strings, everything is designed to work with newline-terminated strings. It's your responsibility not to give the script any pathname that contains newline character(s).

All this means you should run a command like:

find /the/mountpoint \
     -xdev \
     -type f \
     -size +0 \
   ! -name "$(printf '*\n*')" \
| ./deduplicate

Note the script invoked like this will only print what it would do (and the form may be ambiguous). To make the script actually affect the filesystem (deduplicate), you need to pass --no-dry-run as the first argument (find … | ./deduplicate --no-dry-run).

Procedure

The script first uses wc -c. For each file with non-unique size it computes md5sum of the first 8192 bytes (using head -c if supported, falling back to dd if not). For each file with non-unique sum it computes md5sum of the whole file. Files with identical result are considered duplicates: the first one in a bunch is kept intact, others get replaced by hardlinks to it (if --no-dry-run). Note "the first one in a bunch" is not necessarily "the first file found".

The script is quite dumb, it does not check if some pathnames it got already lead to the same file (inode), it even does not check for identical pathnames. E.g. if you use find /the/mountpoint /the/mountpoint … | ./deduplicate then the script will get each pathname twice and this will result in much much work in vain.

Portability

head -c is not portable, but if it's not supported then the script will fall back to dd automatically.

md5sum is not portable but I know your busybox provides it. I used md5sum -b to explicitly request the binary mode; it probably won't matter. If your md5sum does not understand -b then you can omit this option.

False positives

Files with identical final sums are considered duplicates and the final sums are the only criterion; even the sizes and partial sums don't matter at this stage (these are only to rule out obviously unique files early). The script does not run cmp to make sure ln is going to act on truly identical files.

For this reason I introduced mv -- "$pathname" "$pathname~DEDUPED~". This command is responsible for renaming files that otherwise would be replaced by hardlinks, thus "freeing" pathnames and backing up old content. Later you can find all such backup files by invoking

find /the/mountpoint -name '*~DEDUPED~'

review their replacements and revert if necessary. Note the script filters its input with grep -v '~DEDUPED~$', so you can run it again and it won't try to deduplicate these backup files.

Add -exec rm {} + -print to the end of the above command to remove all the backup files in bulk, if this is what you want.

Ownership and mode

The script does not care for ownership and mode (permissions). Hardlinks are just different pathnames for a single file, nothing more, so if ln succeeds then the ownership and mode will be as "shared" as the content. Ownership and/or permissions may result in mv or ln failing. In a comment you stated that this is not a problem for you.

Other users be warned though.

Progress indicator

The script prints progress indicator to /dev/tty. At different stages it consists of . ("sizes" stage), - and + ("partial md5sum" stage) or = and # ("whole md5sum" stage) characters. This is only to show the script is progressing.

The indicator is way from perfect. Error messages (if any) may appear inside.

The easiest way to get rid of the indicator is to change exec 3>/dev/tty to exec 3>/dev/null near the beginning of the script.

3

If a file has "X" number of hard-links to it, and I delete the original file that everyone else is hard linked to, do the remaining hard links remain to a real file?

Yes. In fact you do not delete "the original file", you just unlink a name. A new hardlink is just an alternative pathname to the real file. The old pathname is (and was!) as much a hardlink as the new one. We usually don't call a pathname "hardlink" when it's the only pathname linked to the file; we start doing this when there are two or more, but then each one is equal, there is no original we can distinguish. Only after unlinking all names (and closing all handles) the filesystem will remove the file. Also see this answer: Use cases for hardlinks.