I have a ZFS RAIDZ2 array in my main server and recently it suffered a drive failure. Naturally I was woefully under prepared for this occurrence so I rushed out and bought a replacement drive and got cracking with learning how to safely replace the drive.
Background
If you don’t already have one you need a ZFS array to play with. At least the first part of this article will be based around the test array I built in this article.
Replacing a Disk
Replacing a disk is quite simple if you have a space space in your system for a replacement drive, it’s slightly move involved if there are no spare spaces.
Start by using zpool status
to show you which devices are present in your pool. This is important. You don’t want to accidentality work with the wrong device and it’s easily done when the devices are matched.
zpool status pool: tank state: ONLINE config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST3160812AS_5LS9ADML ONLINE 0 0 0 ata-WDC_WD3200AAKX-001CA0_WD-WCAYUL400860 ONLINE 0 0 0 errors: No known data errors
Since you added devices by ID you’ll get the whole ID of the device as the name.
Add the new drive to the system and then run fdisk
to find all the known devices:
fdisk -l ... snip ... Disk /dev/sdd: 149.01 GiB, 160000000000 bytes, 312500000 sectors Disk model: ST3160812AS Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 2F90C290-3476-EE4C-9647-E00D10BF4DD1 Device Start End Sectors Size Type /dev/sdd1 2048 312481791 312479744 149G Solaris /usr & Apple ZFS /dev/sdd9 312481792 312498175 16384 8M Solaris reserved 1
I’ve snipped out the devices that were present in the base system which is detailed in the previous article and only show the newly added device here. The new device is another 160GB drive, exactly the same model as the one already in the pool. This new device already has a file system on it so it needs wiping. The filesystem present is an old ZFS pool I created when I was last learning this. List devices by ID to find the ID of the new drive. Take care here to pick the correct device. Triple check that the device you are working with is not in the existing pool!
ls -l /dev/disk/by-id/ total 0 lrwxrwxrwx 1 root root 9 Jul 29 09:43 ata-ST2000DL003-9VT166_5YD1KN65 -> ../../sda lrwxrwxrwx 1 root root 10 Jul 29 09:43 ata-ST2000DL003-9VT166_5YD1KN65-part1 -> ../../sda1 lrwxrwxrwx 1 root root 10 Jul 29 09:43 ata-ST2000DL003-9VT166_5YD1KN65-part2 -> ../../sda2 lrwxrwxrwx 1 root root 10 Jul 29 09:43 ata-ST2000DL003-9VT166_5YD1KN65-part3 -> ../../sda3 lrwxrwxrwx 1 root root 9 Jul 29 12:54 ata-ST3160812AS_5LS92RAL -> ../../sdd lrwxrwxrwx 1 root root 10 Jul 29 12:54 ata-ST3160812AS_5LS92RAL-part1 -> ../../sdd1 lrwxrwxrwx 1 root root 10 Jul 29 12:54 ata-ST3160812AS_5LS92RAL-part9 -> ../../sdd9 lrwxrwxrwx 1 root root 9 Jul 29 11:40 ata-ST3160812AS_5LS9ADML -> ../../sdb lrwxrwxrwx 1 root root 10 Jul 29 11:40 ata-ST3160812AS_5LS9ADML-part1 -> ../../sdb1 lrwxrwxrwx 1 root root 10 Jul 29 11:40 ata-ST3160812AS_5LS9ADML-part9 -> ../../sdb9 lrwxrwxrwx 1 root root 9 Jul 29 11:40 ata-WDC_WD3200AAKX-001CA0_WD-WCAYUL400860 -> ../../sdc lrwxrwxrwx 1 root root 10 Jul 29 11:40 ata-WDC_WD3200AAKX-001CA0_WD-WCAYUL400860-part1 -> ../../sdc1 lrwxrwxrwx 1 root root 10 Jul 29 11:40 ata-WDC_WD3200AAKX-001CA0_WD-WCAYUL400860-part9 -> ../../sdc9 ... snip ...
The newly added device is “ata-ST3160812AS_5LS92RAL”. I can tell this because I know that “ata-ST2000DL003-9VT166_5YD1KN65” is my OS drive and “ata-ST3160812AS_5LS9ADML” and “ata-WDC_WD3200AAKX-001CA0_WD-WCAYUL400860” are listed in the pool output.
Issue a wipe command on the new device:
wipefs -a /dev/disk/by-id/ata-ST3160812AS_5LS92RAL /dev/disk/by-id/ata-ST3160812AS_5LS92RAL: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54 /dev/disk/by-id/ata-ST3160812AS_5LS92RAL: 8 bytes were erased at offset 0x2540be3e00 (gpt): 45 46 49 20 50 41 52 54 /dev/disk/by-id/ata-ST3160812AS_5LS92RAL: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa /dev/disk/by-id/ata-ST3160812AS_5LS92RAL: calling ioctl to re-read partition table: Success
Running fdisk -l
will now show no filesystem on the device but more importantly use zpool status
to check that the pool is still good, the output should exactly match what is shown above.
Now issue a replace command to replace the existing 320GB Western Digital drive with the new drive. The new drive is smaller than the drive it’s replacing which is usually not allowed but this should work here because the array was originally limited to 160GB as it was built with mismatched drives.
zpool replace tank /dev/disk/by-id/ata-WDC_WD3200AAKX-001CA0_WD-WCAYUL400860 /dev/disk/by-id/ata-ST3160812AS_5LS92RAL
The command will return nothing if it works. Running zpool status
will now show the pool made up of the two Seagate drives.
zpool status pool: tank state: ONLINE scan: resilvered 1.24M in 00:00:01 with 0 errors on Mon Jul 29 14:11:08 2024 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST3160812AS_5LS9ADML ONLINE 0 0 0 ata-ST3160812AS_5LS92RAL ONLINE 0 0 0 errors: No known data errors
Notice that it reports that the drive was resilvered and that it took just one second to do that, in a real system that resilvering will take much longer usually, this pool only has a tiny amount of data on it. The Western Digital 320GB drive can now be removed from the system. Running zpool status
again after removale should show the pool as being fine. Obviously the removed drive will no longer appear in fdisk etc.
Running the disk replacement on my main server gives output like this while the resilvering is taking place:
zpool status pool: tank state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Jul 29 14:56:44 2024 237G / 38.1T scanned at 19.7G/s, 0B / 38.1T issued 0B resilvered, 0.00% done, no estimated completion time config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 ata-ST16000NM001G-2KK103_ZL2PSV35 ONLINE 0 0 0 ata-ST16000NM001G-2KK103_ZL2ATVC1 ONLINE 0 0 0 ata-ST16000NM001G-2KK103_ZL2PSG9B ONLINE 0 0 0 ata-ST16000NM001G-2KK103_ZL22R6W2 ONLINE 0 0 0 replacing-4 ONLINE 0 0 0 ata-ST16000NM001G-2KK103_ZL21R7LK ONLINE 0 0 0 ata-ST16000NM001G-2KK103_ZL2EPHQB ONLINE 0 0 0 errors: No known data errors
With 38T to resilver it’s going to take a little bit more time than the one second on the test system (probably about 24 hours). I’m not quite sure what the replacing-4 means. I think ZFS keeps track of the devices by a number and it’s referring to the fourth in the array.
Further Thoughts
While resilvering the array I got to thinking about how the array functions when it has a working disk that is being replaced. Asking on Reddit it seems that it’s better if you can leave the failing disk in the array while it’s resilvering. Apparently it makes the resilvering faster and can make it safer. See here. Additionally, this means it’s always a good idea to have a space free for a replacement disk to go in.