3

The Question

After a component device suffered unexpected bit-flip, and after a raid6 repair was successfully performed on the raid device, how to force mdadm to sync the changes from buffer(?) back onto the component device?

And how to monitor when such a repair happens?

The Setup

For testing purposes I made the following setup (using bash on a debian jessie):

sudo -i
mkdir testbed
cd testbed
for i in 1 2 3 4; do
    dd if=/dev/zero of=disk$i bs=1M count=4
    losetup loop$i disk$i
done
mdadm --create /dev/md/test --level=6 --raid-devices=4 /dev/loop{1,2,3,4}
mkfs.vfat /dev/md/test # Note: has easier hexdump than ext
mkdir mounted
mount /dev/md/test mounted
echo "Hello World!" > mounted/message

The unexpected bit-flip

The test scenario assumes that some bit(s) on one of the component devices change while the raid device is not running.

umount mounted
mdadm --stop /dev/md/test
# Note: does show 'H' from 'Hello World!' at position 0x00107a00
hexdump -C /dev/loop1
# manipulate some bits in first component device at 0x00107a00
dd if=/dev/zero bs=1 count=1 seek=1079808 of=/dev/loop1
# Note: now changed to ".ello World!" at position 0x00107a00
hexdump -C /dev/loop1

The repair

Now start the raid device again, and try to persuade mdadm to detect and repair the faulty bits on the component device.

assembling and mounting

mdadm --assemble /dev/md/test /dev/loop{1,2,3,4}
mount /dev/md/test mounted
# dmesg does not show error
# hexdump still shows faulty bits

this is expected.

reading the faulty sectors

cat mounted/message # always reads the non-faulty message
# nothing in dmesg
# no raid6 related message in /var/log/syslog
# /sys/block/md127/md/mismatch_cnt == 0
# hexdump still shows faulty bits

by now, mdadm should have detected the mismatching checksum, and by majority vote determined that /dev/loop1 is faulty. But there is no warning or error count about this anywhere.

initiate repair

echo repair > /sys/block/md127/md/sync_action
sync # should be completely unrelated for this question
# dmesg reports successful resync
# /var/log/syslog replicates the dmesg messages
# hexdump -C /dev/loop1 still shows faulty bits as 0x00107a00

mdadm sure must have noticed the faulty bits by now, but for some reason did not write the repaired chunk back to disk.

need to stop raid device

it seems neccessary to stop the raid device (thus making the file system temporarily unavailable!) in order to force sync of repaired chunk.

umount mounted
mdadm --stop /dev/md/test

finally, hexdump finally shows correct 'H' again. but no indication of faulty chunk in dmesg or syslog nor mismatch_cnt.

0 Answers0