0

I want to build a NAS using mdadm for the RAID and btrfs for the bitrot detection. I have a fairly basic setup, combining 3 1TB disks with mdadm to a RAID5, than btrfs on top of that.

I know that mdadm cannot repair bitrot. It can only tell me when there are mismatches but it doesn't know which data is correct and which is faulty. When I tell mdadm to repair my md0 after I simulate bitrot, it always rebuilds the parity. Btrfs uses checksums so it knows which data is faulty, but it cannot repair the data since it cannot see the parity.

I can however run a btrfs scrub and read the syslog to get the offset of the data that did not match its checksum. I then can translate this offset to a disk and a offset on that disk, because I know the data start offset of md0 (2048 * 512), the chunk size (512K) and the layout (left-symmetric). The layout means that in my first layer the parity is on the third disk, in the second layer on the second disk, and in the third layer on the first disk.

Combining all this data and some more btrfs on disk-format knowledge, I can calculate exactly which chunk of which disk is the faulty one. However, I cannot find a way to tell mdadm to repair this specific chunk.

I already wrote a script which swaps the parity and the faulty chunk using the dd command, then starts a repair with mdadm and then swaps them back, but this is not a good solution and I would really want mdadm to mark this sector as bad and don't use it again. Since it started to rot, chances are high it will do it again.

My question is: is there any way to tell mdadm to repair a single chunk (which is not the parity) and possibly even mark a disk sector as bad? Maybe creating a read io error?

( And I know ZFS can do all this all by itself, but I don't want to use ECC memory )

Edit: this question / answer is about how btrfs RAID6 is unstable and how ZFS is much more stable / useable. That does not address my question about how to repair a single known faulty chunk with mdadm.

2 Answers2

1

I cannot find a way to tell mdadm to repair this specific chunk.

That's because when there is silent data corruption, md does not have enough information to know which block is silently corrupted.

I invite you to read my answer to question #4 ("Why does md continue to use a device with invalid data?") here, which explains this in further detail.

To make matters worse for your proposed layout, if a parity block suffers from silent data corruption, the Btrfs layer above can't see it! When a disk with the corresponding data block fails and you try to replace it, md will use the corrupted parity and irreversibly corrupt your data. Only when that disk fails will Btrfs recognize the corruption, but you have already lost the data.

This is because md does not read from parity blocks unless the array is degraded.


So is there any way to tell mdadm to repair a single chunk (which is not the parity) and possibly even mark a disk sector as bad? Maybe creating a read io error?

For bad sectors that the hard drive figured out itself, md can cope with that easily because the bad sector is identified to md.

You can technically make a bad sector with hdparm --make-bad-sector, but how do you know which disk has the block affected by silent data corruption?

Consider this simplified example:

Parity formula: PARITY = DATA_1 + DATA_2

+--------+--------+--------+
| DATA_1 | DATA_2 | PARITY |
+--------+--------+--------+
|      1 |      1 |      2 | # OK
+--------+--------+--------+

Now let's corrupt each of the blocks silently with a value of 3:

+--------+--------+--------+
| DATA_1 | DATA_2 | PARITY |
+--------+--------+--------+
|      3 |      1 |      2 | # Integrity failed – Expected: PARITY = 4
|      1 |      3 |      2 | # Integrity failed – Expected: PARITY = 4
|      1 |      1 |      3 | # Integrity failed – Expected: PARITY = 2
+--------+--------+--------+

If you didn't have the first table to look at, how would you know which block was corrupted?
You can't know for sure.

This is why Btrfs and ZFS both checksum blocks. It takes a little more disk space, but this extra information lets the storage system figure out which block is lying.

From Jeff Bonwick's blog article "RAID-Z":

Whenever you read a RAID-Z block, ZFS compares it against its checksum. If the data disks didn't return the right answer, ZFS reads the parity and then does combinatorial reconstruction to figure out which disk returned bad data.

To do this with Btrfs on md, you would have to try recalculating each block until the checksum matches in Btrfs, a time-consuming process with no easy interface exposed to the user/script.


I know ZFS can do all this all by itself, but I don't want to use ECC memory

Neither ZFS nor Btrfs over md depends on or is even aware of ECC memory. ECC memory only catches silent data corruption in memory, so it's storage system-agnostic.

I've recommended ZFS over Btrfs for RAID-5 and RAID-6 (analogous to ZFS RAID-Z and RAID-Z2, respectively) before in Btrfs over mdadm raid6? and Fail device in md RAID when ATA stops responding, but I would like to take this opportunity to outline a few more advantages of ZFS:

  • When ZFS detects silent data corruption, it is automatically and immediately corrected on the spot without any human intervention.
  • If you need to rebuild an entire disk, ZFS will only "resilver" the actual data instead of needlessly running across the whole block device.
  • ZFS is an all-in-one solution to logical volumes and file systems, which makes it less complex to manage than Btrfs on top of md.
  • RAID-Z and RAID-Z2 are reliable and stable, unlike
    • Btrfs on md RAID-5/RAID-6, which only offers error detection on silently corrupted data blocks (plus silently corrupted parity blocks may go undetected until it's too late) and no easy way to do error correction, and
    • Btrfs RAID-5/RAID-6, which "has multiple serious data-loss bugs in it".
  • If I silently corrupted an entire disk with ZFS RAID-Z2, I would lose no data at all whereas on md RAID-6, I actually lost 455,681 inodes.
Deltik
  • 19,971
-1

I found a way to create a read error for mdadm.

With dmsetup you can create logical devices from tables.

Devices are created by loading a table that specifies a target for each sector (512 bytes)

From: manpage

In these tables you can specify offsets which should return an IO error, for example:

0 4096 linear /dev/sdb 0
4096 1 error
4097 2093055 linear /dev/sdb 4097

This creates a device (1GB) with an error at offset 4096*512.