I have many different disks (primarily harddisks) storing various files. I want to know that they are all backed up in some form. Given that I have terabytes of files somehow (backups of backups apparently) I don't want to just backup everything onto new media yet again. I'd like to maintain some form of database of files and use it to quickly and easily identify all files on X that do not yet exist on Y to Y, and ideally also
- list all files on X that are not duplicated/backed up on other media
- Deduplicate files on X
- list all files that are not duplicated onto offline/WORM/offsite storage
- ideally also Match JPGs by EXIF date.
The first step towards this would be to maintain a database base of the the hashes of files on all the harddisks. So, how would I maintain a database of hashes of many terabytes of files?
It would at first appear that hashdeep would be sufficient, but it does not seem to have anyway of updating an existing database, so updating the database would require scanning many terabytes of files. du -ab is fast enough, and filename+filesize gives a fairly good indicate as to whether two files are duplicates; however, having hashes would clearly be much more reliable.