The Question
After a component device suffered unexpected bit-flip, and after a raid6 repair was successfully performed on the raid device, how to force mdadm to sync the changes from buffer(?) back onto the component device?
And how to monitor when such a repair happens?
The Setup
For testing purposes I made the following setup (using bash on a debian jessie):
sudo -i
mkdir testbed
cd testbed
for i in 1 2 3 4; do
dd if=/dev/zero of=disk$i bs=1M count=4
losetup loop$i disk$i
done
mdadm --create /dev/md/test --level=6 --raid-devices=4 /dev/loop{1,2,3,4}
mkfs.vfat /dev/md/test # Note: has easier hexdump than ext
mkdir mounted
mount /dev/md/test mounted
echo "Hello World!" > mounted/message
The unexpected bit-flip
The test scenario assumes that some bit(s) on one of the component devices change while the raid device is not running.
umount mounted
mdadm --stop /dev/md/test
# Note: does show 'H' from 'Hello World!' at position 0x00107a00
hexdump -C /dev/loop1
# manipulate some bits in first component device at 0x00107a00
dd if=/dev/zero bs=1 count=1 seek=1079808 of=/dev/loop1
# Note: now changed to ".ello World!" at position 0x00107a00
hexdump -C /dev/loop1
The repair
Now start the raid device again, and try to persuade mdadm to detect and repair the faulty bits on the component device.
assembling and mounting
mdadm --assemble /dev/md/test /dev/loop{1,2,3,4}
mount /dev/md/test mounted
# dmesg does not show error
# hexdump still shows faulty bits
this is expected.
reading the faulty sectors
cat mounted/message # always reads the non-faulty message
# nothing in dmesg
# no raid6 related message in /var/log/syslog
# /sys/block/md127/md/mismatch_cnt == 0
# hexdump still shows faulty bits
by now, mdadm should have detected the mismatching checksum, and by majority vote determined that /dev/loop1 is faulty. But there is no warning or error count about this anywhere.
initiate repair
echo repair > /sys/block/md127/md/sync_action
sync # should be completely unrelated for this question
# dmesg reports successful resync
# /var/log/syslog replicates the dmesg messages
# hexdump -C /dev/loop1 still shows faulty bits as 0x00107a00
mdadm sure must have noticed the faulty bits by now, but for some reason did not write the repaired chunk back to disk.
need to stop raid device
it seems neccessary to stop the raid device (thus making the file system temporarily unavailable!) in order to force sync of repaired chunk.
umount mounted
mdadm --stop /dev/md/test
finally, hexdump finally shows correct 'H' again. but no indication of faulty chunk in dmesg or syslog nor mismatch_cnt.