How to deduplicate 40TB of data?

Question

I've inherited a research cluster with ~40TB of data across three filesystems. The data stretches back almost 15 years, and there are most likely a good amount of duplicates as researchers copy each others data for different reasons and then just hang on to the copies.

I know about de-duping tools like fdupes and rmlint. I'm trying to find one that will work on such a large dataset. I don't care if it takes weeks (or maybe even months) to crawl all the data - I'll probably throttle it anyway to go easy on the filesystems. But I need to find a tool that's either somehow super efficient with RAM, or can store all the intermediary data it needs in files rather than RAM. I'm assuming that my RAM (64GB) will be exhausted if I crawl through all this data as one set.

I'm experimenting with fdupes now on a 900GB tree. It's 25% of the way through and RAM usage has been slowly creeping up the whole time, now it's at 700MB.

Or, is there a way to direct a process to use disk-mapped RAM so there's much more available and it doesn't use system RAM?

I'm running CentOS 6.

score 4 · Answer 1 · answered Aug 24 '14 at 17:42

Or, is there a way to direct a process to use disk-mapped RAM so there's much more available and it doesn't use system RAM?

Yes, It's called the swap drive. You probably already have one. If you're worried about running out of RAM then increasing this is a good place to start. It works automatically though so there is no need to do anything special.

I would not worry about fdupes. Try it, it should work without problems.

score 1 · Answer 2 · edited Sep 03 '14 at 04:29

1

finding duplicates based on hashkey works well and is very fast.

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

edited Sep 03 '14 at 04:29

FiveO

8,268

answered Sep 03 '14 at 01:54

kumar

61

score 0 · Answer 3 · answered Sep 07 '14 at 20:28

Write a quick app to walk the trees, either pushing (hash, mtime)=>filepath into a dictionary or marking the file for deletion if the entry already exists. The hash will just be an MD5 calculated over the first N bytes. You might do a couple of different passes, with a hash over a small N and then another with a hash over a large N.

You could probably do this in less than twenty- or thirty-lines of Python (using os.walk()).

How to deduplicate 40TB of data?

3 Answers3