May have just screwed the pooch attempting to follow instructions in another thread to shrink my RAID

Question

I used @Paul (https://superuser.com/users/89018/paul) instructions in his answer to Shrink RAID by removing a disk? but I think I may have made a terrible mistake. Here's the lowdown...

I've been upgrading the 4TB drives in my DS1813+ one by one with Seagate Ironwolf 10TB drives. I had one drive left to upgrade but figured rather than go through the day+ process of rebuilding the array after upgrading the drive and then going through Paul's process that instead I'd simply remove the 4TB drive from the array during the process of shrinking I'd be able to fail it; unfortunately, that wasn't the case and I fear it may now be too late for my 22TB of data. Here is my PuTTY session:

ash-4.3# pvdisplay -C
      PV         VG   Fmt  Attr PSize  PFree
      /dev/md2   vg1  lvm2 a--  25.44t 50.62g
ash-4.3# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdf3[13] sdh3[7] sdb3[9] sdg3[6] sde3[12] sdd3[11] sdc3[10] sda3[8]
      27316073792 blocks super 1.2 level 5, 64k chunk, algorithm 2 [8/8] [UUUUUUUU]

md1 : active raid1 sdf2[5] sda2[1] sdb2[7] sdc2[2] sdd2[3] sde2[4] sdg2[6] sdh2[0]
      2097088 blocks [8/8] [UUUUUUUU]

md0 : active raid1 sdf1[5] sda1[1] sdb1[7] sdc1[2] sdd1[3] sde1[4] sdg1[6] sdh1[0]
      2490176 blocks [8/8] [UUUUUUUU]

unused devices: <none>
ash-4.3# exit
exit
Rob@Apophos-DS:~$ df -h
Filesystem         Size  Used Avail Use% Mounted on
/dev/md0           2.3G  940M  1.3G  43% /
none               2.0G  4.0K  2.0G   1% /dev
/tmp               2.0G  656K  2.0G   1% /tmp
/run               2.0G  9.8M  2.0G   1% /run
/dev/shm           2.0G  4.0K  2.0G   1% /dev/shm
none               4.0K     0  4.0K   0% /sys/fs/cgroup
cgmfs              100K     0  100K   0% /run/cgmanager/fs
/dev/vg1/volume_3  493G  749M  492G   1% /volume3
/dev/vg1/volume_1  3.4T  2.3T  1.1T  69% /volume1
/dev/vg1/volume_2   22T   19T  2.4T  89% /volume2
Rob@Apophos-DS:~$ pvdisplay -C
  WARNING: Running as a non-root user. Functionality may be unavailable.
  /var/lock/lvm/P_global:aux: open failed: Permission denied
  Unable to obtain global lock.
Rob@Apophos-DS:~$ sudo su
Password:
ash-4.3# pvdisplay -C
  PV         VG   Fmt  Attr PSize  PFree
  /dev/md2   vg1  lvm2 a--  25.44t 50.62g
ash-4.3# mdadm --grow -n5 /dev/md2
mdadm: max_devs [384] of [/dev/md2]
mdadm: this change will reduce the size of the array.
       use --grow --array-size first to truncate array.
       e.g. mdadm --grow /dev/md2 --array-size 15609185024
ash-4.3# mdadm --grow /dev/md2 --array-size 15609185024
ash-4.3# pvdisplay -C
  PV         VG   Fmt  Attr PSize  PFree
  /dev/md2   vg1  lvm2 a--  25.44t 50.62g
ash-4.3# mdadm --grow -n6 /dev/md2
mdadm: max_devs [384] of [/dev/md2]
mdadm: Need to backup 2240K of critical section..
mdadm: /dev/md2: Cannot grow - need backup-file
ash-4.3# mdadm --grow -n5 /dev/md2
mdadm: max_devs [384] of [/dev/md2]
mdadm: Need to backup 1792K of critical section..
mdadm: /dev/md2: Cannot grow - need backup-file
ash-4.3# mdadm --grow -n5 /dev/md2 --backup-file /root/mdadm.md0.backup
mdadm: max_devs [384] of [/dev/md2]
mdadm: Need to backup 1792K of critical section..
ash-4.3# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdf3[13] sdh3[7] sdb3[9] sdg3[6] sde3[12] sdd3[11] sdc3[10] sda3[8]
      15609185024 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  0.0% (216708/3902296256) finish=3000.8min speed=21670K/sec

md1 : active raid1 sdf2[5] sda2[1] sdb2[7] sdc2[2] sdd2[3] sde2[4] sdg2[6] sdh2[0]
      2097088 blocks [8/8] [UUUUUUUU]

md0 : active raid1 sdf1[5] sda1[1] sdb1[7] sdc1[2] sdd1[3] sde1[4] sdg1[6] sdh1[0]
      2490176 blocks [8/8] [UUUUUUUU]

unused devices: <none>
ash-4.3# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdf3[13] sdh3[7] sdb3[9] sdg3[6] sde3[12] sdd3[11] sdc3[10] sda3[8]
      15609185024 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  0.0% (693820/3902296256) finish=3230.3min speed=20129K/sec

md1 : active raid1 sdf2[5] sda2[1] sdb2[7] sdc2[2] sdd2[3] sde2[4] sdg2[6] sdh2[0]
      2097088 blocks [8/8] [UUUUUUUU]

md0 : active raid1 sdf1[5] sda1[1] sdb1[7] sdc1[2] sdd1[3] sde1[4] sdg1[6] sdh1[0]
      2490176 blocks [8/8] [UUUUUUUU]

unused devices: <none>
ash-4.3# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdf3[13] sdh3[7] sdb3[9] sdg3[6] sde3[12] sdd3[11] sdc3[10] sda3[8]
      15609185024 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  0.0% (1130368/3902296256) finish=6500.6min speed=10001K/sec

md1 : active raid1 sdf2[5] sda2[1] sdb2[7] sdc2[2] sdd2[3] sde2[4] sdg2[6] sdh2[0]
      2097088 blocks [8/8] [UUUUUUUU]

md0 : active raid1 sdf1[5] sda1[1] sdb1[7] sdc1[2] sdd1[3] sde1[4] sdg1[6] sdh1[0]
      2490176 blocks [8/8] [UUUUUUUU]

unused devices: <none>
ash-4.3# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdf3[13] sdh3[7] sdb3[9] sdg3[6] sde3[12] sdd3[11] sdc3[10] sda3[8]
      15609185024 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  0.0% (1442368/3902296256) finish=6667.7min speed=9750K/sec

md1 : active raid1 sdf2[5] sda2[1] sdb2[7] sdc2[2] sdd2[3] sde2[4] sdg2[6] sdh2[0]
      2097088 blocks [8/8] [UUUUUUUU]

md0 : active raid1 sdf1[5] sda1[1] sdb1[7] sdc1[2] sdd1[3] sde1[4] sdg1[6] sdh1[0]
      2490176 blocks [8/8] [UUUUUUUU]

unused devices: <none>
ash-4.3# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdf3[13] sdh3[7] sdb3[9] sdg3[6] sde3[12] sdd3[11] sdc3[10] sda3[8]
      15609185024 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  0.4% (18826624/3902296256) finish=6706.8min speed=9650K/sec

md1 : active raid1 sdf2[5] sda2[1] sdb2[7] sdc2[2] sdd2[3] sde2[4] sdg2[6] sdh2[0]
      2097088 blocks [8/8] [UUUUUUUU]

md0 : active raid1 sdf1[5] sda1[1] sdb1[7] sdc1[2] sdd1[3] sde1[4] sdg1[6] sdh1[0]
      2490176 blocks [8/8] [UUUUUUUU]

unused devices: <none>
ash-4.3#
Broadcast message from root@Apophos-DS
        (unknown) at 22:16 ...

The system is going down for reboot NOW!
login as: Rob
Rob@192.168.81.181's password:
Could not chdir to home directory /var/services/homes/Rob: No such file or directory
Rob@Apophos-DS:/$ sudo su
Password:
ash-4.3# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md1 : active raid1 sdh2[7] sdg2[6] sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1] sda2[0]
      2097088 blocks [8/8] [UUUUUUUU]
      [=====>...............]  resync = 26.8% (563584/2097088) finish=2.4min speed=10314K/sec

md2 : active raid5 sdh3[7] sdb3[9] sdf3[13] sdg3[6] sde3[12] sdd3[11] sdc3[10] sda3[8]
      15609185024 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  0.5% (19578240/3902296256) finish=10384.2min speed=6231K/sec

md0 : active raid1 sda1[1] sdb1[7] sdc1[2] sdd1[3] sde1[4] sdf1[5] sdg1[6] sdh1[0]
      2490176 blocks [8/8] [UUUUUUUU]

unused devices: <none>

Now, with the back story and the readout from my PuTTY I'm hoping someone can tell me how to unscrew myself. I believe my problem - after having started the process without sufficient foresight, consideration and full understanding of the process itself - is twofold: I didn't fail the final remaining 4TB drive beforehand so the software was basing calculations off the smallest size drive of 4TB (likely not taking into account the 70TB of space between the other 7 drives), and possibly my mdadm --grow commands of different -n#.

    ash-4.3# mdadm --grow -n5 /dev/md2
    mdadm: max_devs [384] of [/dev/md2]
    mdadm: this change will reduce the size of the array.
           use --grow --array-size first to truncate array.
           e.g. mdadm --grow /dev/md2 --array-size 15609185024
    ash-4.3# mdadm --grow /dev/md2 --array-size 15609185024
    ash-4.3# pvdisplay -C
      PV         VG   Fmt  Attr PSize  PFree
      /dev/md2   vg1  lvm2 a--  25.44t 50.62g
    ash-4.3# mdadm --grow -n6 /dev/md2
    mdadm: max_devs [384] of [/dev/md2]
    mdadm: Need to backup 2240K of critical section..
    mdadm: /dev/md2: Cannot grow - need backup-file
    ash-4.3# mdadm --grow -n5 /dev/md2
    mdadm: max_devs [384] of [/dev/md2]
    mdadm: Need to backup 1792K of critical section..
    mdadm: /dev/md2: Cannot grow - need backup-file
    ash-4.3# mdadm --grow -n5 /dev/md2 --backup-file /root/mdadm.md0.backup
    mdadm: max_devs [384] of [/dev/md2]
    mdadm: Need to backup 1792K of critical section..

Here is the current output from cat /proc/mdstat - I noticed that /dev/md2 shows only 5 Us compared to the 8Us of the other mds and that scares me since they're all volumes on the same RAID group of 8 disks:

ash-4.3# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md1 : active raid1 sdh2[7] sdg2[6] sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1] sda2[0]
      2097088 blocks [8/8] [UUUUUUUU]

md2 : active raid5 sdh3[7] sdb3[9] sdf3[13] sdg3[6] sde3[12] sdd3[11] sdc3[10] sda3[8]
      15609185024 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
      [>....................]  reshape =  1.2% (48599680/3902296256) finish=6495.2min speed=9888K/sec

md0 : active raid1 sda1[1] sdb1[7] sdc1[2] sdd1[3] sde1[4] sdf1[5] sdg1[6] sdh1[0]
      2490176 blocks [8/8] [UUUUUUUU]

unused devices: <none>

At the very least I need to be able to save /dev/vg1/volume_1. I'm hoping since I didn't touch that volume that it'll be possible but at this point I don't know since all 3 volumes are list as "Crashed" in DSM. I'm hoping (but not hopeful) that once the Consistency Check completes everything will just be okay.

Anyone who knows mdadm I'm in desperate need of your help! Paul, if you're out there, I need your assistance! I know I screwed up and there's a good chance I lost everything but if there's anything you can suggest that has any chance of saving my bacon please help!

Update (12/5/17): No change except the reshaping continues to progess - up to 17.77% now. DSM still shows all volumes as "Crashed (Checking parity consistency 17.77%)" while the disk group says "Verifying hard disks in the background (Checking parity consistency 17.77%)". Here's an image of the disk group:

I believe the critical step I missed was either running mdadm /dev/md2 --fail /dev/sdf3 --remove /dev/sdf3 or manually removing the drive - this would have failed the remaining 4TB drive and removed it from the array leaving me with a 7 x 10TB degraded RAID 5 array. My question now is - should I wait until after the array is finished reshaping to remove the 4TB drive? Or should I go ahead and fail/remove it now? My spidey-sense says removing a drive during a rebuild/reshape will turn out poorly since that's what I've always been taught, but I don't know if that's necessarily true in this situation where mdadm is trying to cram 7 drives worth of space into 5 drives based solely on the size of the remaining 4TB drive.

Also, in case it's helpful here is the output from mdadm -D /dev/md2:

/dev/md2:
        Version : 1.2
  Creation Time : Wed Mar  5 22:45:07 2014
     Raid Level : raid5
     Array Size : 15609185024 (14886.08 GiB 15983.81 GB)
  Used Dev Size : 3902296256 (3721.52 GiB 3995.95 GB)
   Raid Devices : 5
  Total Devices : 8
    Persistence : Superblock is persistent

    Update Time : Tue Dec  5 17:46:27 2017
          State : clean, recovering
 Active Devices : 8
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

 Reshape Status : 18% complete
  Delta Devices : -3, (5->2)

           Name : DS:2  (local to host DS)
           UUID : UUID
         Events : 153828

    Number   Major   Minor   RaidDevice State
       7       8      115        0      active sync   /dev/sdh3
       8       8        3        1      active sync   /dev/sda3
      10       8       35        2      active sync   /dev/sdc3
      11       8       51        3      active sync   /dev/sdd3
      12       8       67        4      active sync   /dev/sde3

       6       8       99        5      active sync   /dev/sdg3
       9       8       19        7      active sync   /dev/sdb3
      13       8       83        6      active sync   /dev/sdf3

What worries me about this is that the Array Size is listed as 16TB when the total size of the data on the array was over 20TB. I'm not sure what I should do at this point. Any thoughts or experience would be greatly appreciated!

May have just screwed the pooch attempting to follow instructions in another thread to shrink my RAID

0 Answers0

Linked