3

I have a 9TB XFS partition consisting of four 3TB disks in a RAID-5 array with a chunk size of 256KB, using MDADM.

When I created the partition, the optimal stripe unit and width values (64 and 192 blocks) were detected and set automatically, which xfs_info confirms:

# xfs_info /dev/md3
meta-data=/dev/md3               isize=256    agcount=32, agsize=68675072 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=2197600704, imaxpct=5
         =                       sunit=64     swidth=192 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

However, I was experiencing slow transfer speeds, and while investigating I noticed that unless I specifically mount the partition with -o sunit=64,swidth=192, the stripe unit is always set to 512 and the width at 1536. For instance:

# umount /dev/md3
# mount -t xfs -o rw,inode64 /dev/md3 /data
# grep xfs /proc/mounts
/dev/md3 /data xfs rw,relatime,attr2,delaylog,inode64,logbsize=256k,sunit=512,swidth=1536,noquota 0 0

Is this intended behavior? I suppose that I could just start mounting it with sunit=64,swidth=192 every time, but wouldn't that make the current data (which was written while mounted with sunit=512,swidth=1536) misaligned?

The operating system is Debian Wheezy with kernel 3.2.51. All four harddisks are advanced format disks (smartctl says 512 bytes logical, 4096 bytes physical). The fact that the values are multiplied by 8 makes me wonder if this has anything to do with the issue, seeing that it matches the multiplication factor between 512 and 4096 sector size disks.

Can anyone shed some light on this? :-)

Sauron
  • 33

1 Answers1

4

Your mystery multiply-by-8 is because xfs_info shows sunit/swidth in bsize blocks, typically 4096 bytes. When specifying sunit/swidth in mount with -o or fstab, they are specified in 512-byte units. Note the "blks" string after the sunit/swidth numbers on your sample xfs_info output. 4096/512=8, hence the mystery multiplier.

man 5 xfs spells this out in the sunit stanza, as does mkfs.xfs, regarding 512B units.

man xfs_growfs, which doubles as the manpage for xfs_info, spells out how the units for xfs_info are bsize bytes.

Confusing, yes. Very bad design choice from a UI perspective, yes.

Specifying "-o sunit=64,swidth=192" was probably a bad idea, as really you wanted 64/8=8 and 192/8=24. You may have "hardcoded" the 8-times-greater values into the FS now having mounted them with the larger numbers. The man page is pretty explicit about never being able to switch to a lower sunit. However, you could probably try, and see if you get mount errors. Mount for XFS should (but no guarantees) be robust enough to not eat your data: it should just spit out an error and refuse to mount, or mount with sane options ignoring what you specify. Make backups first.

That said, there may actually be nothing wrong with 8-times-greater sunit/swidth, as this is all about alignment, and those numbers are still aligned. Perhaps there may be fragmentation issues or issues if most of your files are tiny?

Aside: What I am working on now and finding intriguing is what to change sunit/swidth values to when you grow/reshape your md RAID by adding 1 disk. From the man page it appears you cannot change sunit unless you literally double the number of disks, but it seems changing swidth is still possible. Whether this results in proper alignment in most cases remains to be seen. Information from people actually doing this seems scarce.