I have had in the recent weeks several cases of data corruption that occurred while copying files from one disk to another. Question is: what can be the cause, and how do I pinpoint it?
Some clues:
- problems (9 cases) occurred on two different machines (one AMD 5050e with ECC RAM, the other some netbook), both running Win7x64 SP1 with no crash or other apparent problem;
- problems occurred while copying large amount of data (total about 3TB) from one disk to another;
- copy was with the standard GUI (Windows Explorer) that reported no error;
- original file and copy have the same size and modification date;
- data corruption was detected using MD5 hash (
md5sumor/and Microsoft's FCIV), which was wrong on the copy (the MD5 of both original and copy is repeatable); fc /Brepeatably reports the differences, that always have been on contiguous blocks with exactly 4 kiB boundary (10 cases: one file was hit twice);- blocks in error are of varying size, from 4 kiB to 52 kiB, at seemingly random location, in large files (typically some GB);
- corrupted blocks show no apparent relation to the original; in about half the cases, the corrupted data was all-zero;
- all disks involved are NTFS, and given a clean bill of health by
chkdsk /f(no bad block, no error reported); - the two affected destination disks are USB (the HD happens to be from the same manufacturer, but I can't say this is significant)
- one is a 2.5" 2 TB housed in a self-powered USB 3 (Super-Speed, used in Hi-Speed) enclosure bearing the HD manufacturer's brand;
- one is a 3.5" 1.5 TB in a Linux-based multimedia enclosure (PCH A-200) with USB 2 (Hi-Speed) slave port;
- in more than half the cases the corruption was detected like an hour after the copy, with no disconnection or reboot involved; in most or all others the destination disks have been properly ejected;
- I have no reason to suspect the various source disks (mostly SATA, some SSD).
Addition: I'm really concerned by finding the root cause and pinpointing the culprit(s), more than by working around the issue.
I reason that all the technologies involved are supposed to have a very low rate of undetected errors compared to reported ones (and I have no report of error). Therefore
- if the error trigger was the magnetic media (an hypothesis that very well matches the observed 4kiB alignment, which I believe matches the internal physical sector size of the disks), it doubles with a disastrous bug somewhere preventing the error to be reported, as it would be (I know from experience) on a at least a read error in a SATA disk of my (different) favorite brand;
- if the error trigger was poor electrical contact of USB cabling and undetected by CRC (as suggested by an answer); and given that the USB 2 maximum data packet size is 1kiB according to this source, not 4kiB as the alignment of all my errors; there must be some additional bug in the handling of errors (or a gaping hole in the USB specs or how they handle hard disks).