Quickly Maintain Database of sha256sums?

Question

I have many different disks (primarily harddisks) storing various files. I want to know that they are all backed up in some form. Given that I have terabytes of files somehow (backups of backups apparently) I don't want to just backup everything onto new media yet again. I'd like to maintain some form of database of files and use it to quickly and easily identify all files on X that do not yet exist on Y to Y, and ideally also

list all files on X that are not duplicated/backed up on other media
Deduplicate files on X
list all files that are not duplicated onto offline/WORM/offsite storage
ideally also Match JPGs by EXIF date.

The first step towards this would be to maintain a database base of the the hashes of files on all the harddisks. So, how would I maintain a database of hashes of many terabytes of files?

It would at first appear that hashdeep would be sufficient, but it does not seem to have anyway of updating an existing database, so updating the database would require scanning many terabytes of files. du -ab is fast enough, and filename+filesize gives a fairly good indicate as to whether two files are duplicates; however, having hashes would clearly be much more reliable.

score 0 · Answer 1 · answered Mar 16 '14 at 12:54

Perhaps there is no easy way of doing this, and considerable scripting is required. If so, I'll keep the scripts at: https://github.com/gmatht/joshell/blob/master/mass_file_management/

At the moment these don't do much more than parse the output of du to guess which files are new and compute how many MB the new unique files would require to archive. A better solution would be preferred.

Quickly Maintain Database of sha256sums?

1 Answers1