/dev/ada1 failing
smartd sent me not one, but two emails today, regarding /dev/ada1.
From: Superuser <root@enterprise.ximalas.info> To: hostmaster@ximalas.info Date: Mon, 13 Apr 2015 12:33:34 +0200 (CEST) Subject: SMART error (Health) detected on host: enterprise This message was generated by the smartd daemon running on: host name: enterprise DNS domain: ximalas.info The following warning/error was logged by the smartd daemon: Device: /dev/ada1, FAILED SMART self-check. BACK UP DATA NOW! Device info: ST500DM002-1BD142, S/N:[withheld], WWN:5-000c50-03f5711f2, FW:KC45, 500 GB For details see host's SYSLOG. You can also use the smartctl utility for further investigation. No additional messages about this problem will be sent.
From: Superuser <root@enterprise.ximalas.info> To: hostmaster@ximalas.info Date: Mon, 13 Apr 2015 12:33:37 +0200 (CEST) Subject: SMART error (Usage) detected on host: enterprise This message was generated by the smartd daemon running on: host name: enterprise DNS domain: ximalas.info The following warning/error was logged by the smartd daemon: Device: /dev/ada1, Failed SMART usage Attribute: 5 Reallocated_Sector_Ct. Device info: ST500DM002-1BD142, S/N:[withheld], WWN:5-000c50-03f5711f2, FW:KC45, 500 GB For details see host's SYSLOG. You can also use the smartctl utility for further investigation. No additional messages about this problem will be sent.
smartctl has this to say about /dev/ada1 and its S.M.A.R.T. attributes:
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 099 006 Pre-fail Always - 3136808 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 59 5 Reallocated_Sector_Ct 0x0033 035 035 036 Pre-fail Always FAILING_NOW 21480 7 Seek_Error_Rate 0x000f 087 060 030 Pre-fail Always - 601807587 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 28475 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 59 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 099 000 Old_age Always - 0 0 2 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 069 053 045 Old_age Always - 31 (Min/Max 30/36) 194 Temperature_Celsius 0x0022 031 047 000 Old_age Always - 31 (0 19 0 0 0) 195 Hardware_ECC_Recovered 0x001a 034 028 000 Old_age Always - 3136808 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 4 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 28475h+00m+17.596s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2621705511 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 2832710658
The 4 instances of CRC errors indicate that something could be wrong with the SATA cable. The raw read error rate is suspicious when compared to that of /dev/ada0, it’s more than twice as big.
/dev/ada0 has been /dev/ada1‘s partner for the last 3 years or so, and here are the former’s S.M.A.R.T. attributes:
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 099 099 006 Pre-fail Always - 1502664 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 59 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 087 060 030 Pre-fail Always - 623196340 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 28475 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 59 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 068 053 045 Old_age Always - 32 (Min/Max 31/37) 194 Temperature_Celsius 0x0022 032 047 000 Old_age Always - 32 (0 19 0 0 0) 195 Hardware_ECC_Recovered 0x001a 038 023 000 Old_age Always - 1502664 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 28475h+59m+05.142s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 412698749 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3839349951
ZFS is still happy as ever:
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: ONLINE
scan: scrub repaired 0 in 1h52m with 0 errors on Sun Apr 12 05:40:43 2015
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
errors: No known data errors
/dev/ada1 is partitioned like this:
root@enterprise:~>gpart show -l ada1
=> 34 976773101 ada1 GPT (465G)
34 6 - free - (3.0k)
40 256 1 gptboot1 (128k)
296 1752 - free - (876k)
2048 33554432 2 swap1 (16G)
33556480 943216648 3 enterprise_zroot1 (449G)
976773128 7 - free - (3.5k)
For completeness, here are the S.M.A.R.T. attributes for the three remaining harddrives:
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 105 099 006 Pre-fail Always - 317864 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 59 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 088 060 030 Pre-fail Always - 684637847 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 28478 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 59 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 070 056 045 Old_age Always - 30 (Min/Max 30/35) 194 Temperature_Celsius 0x0022 030 044 000 Old_age Always - 30 (0 19 0 0 0) 195 Hardware_ECC_Recovered 0x001a 051 030 000 Old_age Always - 317864 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 28477h+42m+14.509s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 1111675742 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1742318989
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 105 099 006 Pre-fail Always - 56200 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 59 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 087 060 030 Pre-fail Always - 666707923 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 28475 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 59 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 1 1 1 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 069 051 045 Old_age Always - 31 (Min/Max 30/36) 194 Temperature_Celsius 0x0022 031 049 000 Old_age Always - 31 (0 20 0 0 0) 195 Hardware_ECC_Recovered 0x001a 059 033 000 Old_age Always - 56200 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 28474h+55m+46.515s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2559712939 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1733063492
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 104 099 006 Pre-fail Always - 511904 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 59 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 088 060 030 Pre-fail Always - 678977376 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 28477 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 59 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 096 096 000 Old_age Always - 4 190 Airflow_Temperature_Cel 0x0022 071 053 045 Old_age Always - 29 (Min/Max 28/33) 194 Temperature_Celsius 0x0022 029 047 000 Old_age Always - 29 (0 19 0 0 0) 195 Hardware_ECC_Recovered 0x001a 055 030 000 Old_age Always - 511904 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 28476h+57m+17.472s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 4252139494 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 398472105
I promptly ordered 6 new ST500DM002 harddrives, and I expect them to arrive within the week. Hopefully, the system will be able to keep itself afloat until I can replace the wonky harddrive.
Update 2015-04-20
The failling /dev/ada1 drive was replaced today.
Before replacing the drive, I took a backup of the GPT using gpart backup ada1 > /root/gpart.ada1.txt.
After replacing the drive:
- I booted into single user mode,
- ran
gpart restore -l ada1 < /root/gpart.ada1.txt, - ran
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1, - rebooted to multi user mode,
- ran
zpool online enterprise_zroot ada1p3as suggested by thezpool statuscommand, and - realised
zpool replace enterprise_zroot ada1p3is really the way to go.
I assume the zpool online command is responsible for the 744 checksum errors listed below.
ZFS began resilvering the ada1p3 partition:
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Apr 20 17:12:37 2015
11,0G scanned out of 110G at 9,11M/s, 3h4m to go
11,0G resilvered, 10,04% done
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ada0p3 ONLINE 0 0 0
replacing-1 UNAVAIL 0 0 0
776201329010632765 UNAVAIL 0 0 0 was /dev/ada1p3/old
ada1p3 ONLINE 0 0 744 (resilvering)
errors: No known data errors
Just sit back and enjoy the ride of your life … :P
Here are the last three outputs from the zpool status command:
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Apr 20 17:12:37 2015
109G scanned out of 110G at 15,1M/s, 0h0m to go
109G resilvered, 99,81% done
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ada0p3 ONLINE 0 0 0
replacing-1 UNAVAIL 0 0 0
776201329010632765 UNAVAIL 0 0 0 was /dev/ada1p3/old
ada1p3 ONLINE 0 0 744 (resilvering)
errors: No known data errors
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Apr 20 17:12:37 2015
110G scanned out of 110G at 15,1M/s, (scan is slow, no estimated time)
110G resilvered, 100,11% done
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ada0p3 ONLINE 0 0 0
replacing-1 UNAVAIL 0 0 0
776201329010632765 UNAVAIL 0 0 0 was /dev/ada1p3/old
ada1p3 ONLINE 0 0 744 (resilvering)
errors: No known data errors
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: resilvered 110G in 2h3m with 0 errors on Mon Apr 20 19:16:36 2015
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 744
errors: No known data errors
A subsequent zpool clear reset all counters:
root@enterprise:~>zpool clear enterprise_zroot
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: ONLINE
scan: resilvered 110G in 2h3m with 0 errors on Mon Apr 20 19:16:36 2015
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
errors: No known data errors
Resilvering took about 02:03:59.
The /dev/ada1 drive developed one CRC error, indicating a possible problem with the SATA cable or possibly the SATA connector on the motherboard:
Apr 20 17:17:10 <kern.crit> enterprise kernel: [450] (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 a0 25 bf 40 1e 00 00 01 00 00 Apr 20 17:17:10 <kern.crit> enterprise kernel: [450] (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error Apr 20 17:17:10 <kern.crit> enterprise kernel: [450] (ada1:ahcich1:0:0:0): Retrying command
Here are the S.M.A.R.T. attributes of the new /dev/ada1 drive:
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 006 Pre-fail Always - 248 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 6 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 829317 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 2 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 6 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 067 054 045 Old_age Always - 33 (Min/Max 25/33) 194 Temperature_Celsius 0x0022 033 046 000 Old_age Always - 33 (0 25 0 0 0) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 248 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 1 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 2h+15m+19.425s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 225023812 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 62700
Next, I ran short, conveyance, and long S.M.A.R.T. tests on both /dev/ada0 and /dev/ada1. The results are shown below.
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 28650 - # 2 Conveyance offline Completed without error 00% 28649 - # 3 Short offline Completed without error 00% 28648 - # 4 Extended offline Completed without error 00% 20635 - # 5 Short offline Completed without error 00% 20633 - # 6 Extended offline Completed without error 00% 3721 - # 7 Short offline Completed without error 00% 3720 - # 8 Extended offline Completed without error 00% 3363 - # 9 Short offline Completed without error 00% 3361 - #10 Extended offline Completed without error 00% 3000 - #11 Short offline Completed without error 00% 2999 - #12 Extended offline Completed without error 00% 1451 - #13 Extended offline Completed without error 00% 1311 - #14 Extended offline Completed without error 00% 613 - #15 Extended offline Completed without error 00% 550 - #16 Short offline Completed without error 00% 548 - #17 Extended offline Completed without error 00% 259 - #18 Conveyance offline Completed without error 00% 134 - #19 Extended offline Completed without error 00% 14 - #20 Short offline Completed without error 00% 12 -
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 4 - # 2 Conveyance offline Completed without error 00% 2 - # 3 Short offline Completed without error 00% 2 -
Update 2015-04-21
I ran a scrub on the enterprise_zroot pool this morning, all OK:
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: ONLINE
scan: scrub repaired 0 in 1h47m with 0 errors on Tue Apr 21 10:07:31 2015
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
errors: No known data errors
In the afternoon I replaced the /dev/ada0 drive, the system booted automatically from the /dev/ada1 drive. I selected single user mode from the boot loader's menu, restored the GPT partition table in essentially the same way as I did yesterday, and installed the necessary boot blocks. I rebooted the system and let it complete booting to multi user mode.
ZFS complained about the missing ada0p3 member, I issued the zpool replace command, and ZFS responded by resilvering the new /dev/ada0 drive:
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://illumos.org/msg/ZFS-8000-2Q
scan: scrub repaired 0 in 1h47m with 0 errors on Tue Apr 21 10:07:31 2015
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3
ada1p3 ONLINE 0 0 0
errors: No known data errors
root@enterprise:~>zpool replace enterprise_zroot ada0p3
Make sure to wait until resilver is done before rebooting.
If you boot from pool 'enterprise_zroot', you may need to update
boot code on newly attached disk 'ada0p3'.
Assuming you use GPT partitioning and 'da0' is your new boot disk
you may use the following command:
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Apr 21 16:22:28 2015
34,1M scanned out of 97,4G at 3,79M/s, 7h18m to go
33,7M resilvered, 0,03% done
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
replacing-0 UNAVAIL 0 0 0
14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3/old
ada0p3 ONLINE 0 0 0 (resilvering)
ada1p3 ONLINE 0 0 0
errors: No known data errors
As you might have noticed, the amount of data to be resilvered today is less than on the previous afternoon. I decided to wipe clean /usr/obj and /usr/obj-10 prior to replacing the /dev/ada0 drive, and thus gained some gigabytes of free space.
For some reason, today's resilvering is slower than yesterday's:
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Apr 21 16:22:28 2015
4,53G scanned out of 97,4G at 2,49M/s, 10h36m to go
4,53G resilvered, 4,65% done
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
replacing-0 UNAVAIL 0 0 0
14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3/old
ada0p3 ONLINE 0 0 0 (resilvering)
ada1p3 ONLINE 0 0 0
errors: No known data errors
Maybe the /dev/ada1 SATA cable is wonky. Or maybe I spoke too soon. After I yelled a bit, the pace suddenly increased:
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Apr 21 16:22:28 2015
16,1G scanned out of 97,4G at 7,16M/s, 3h13m to go
16,1G resilvered, 16,52% done
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
replacing-0 UNAVAIL 0 0 0
14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3/old
ada0p3 ONLINE 0 0 0 (resilvering)
ada1p3 ONLINE 0 0 0
errors: No known data errors
Resilvering is now more than a third on its way:
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Apr 21 16:22:28 2015
35,7G scanned out of 97,4G at 10,3M/s, 1h42m to go
35,7G resilvered, 36,68% done
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
replacing-0 UNAVAIL 0 0 0
14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3/old
ada0p3 ONLINE 0 0 0 (resilvering)
ada1p3 ONLINE 0 0 0
errors: No known data errors
Aaaand, we're halfway through the resilvering process:
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Apr 21 16:22:28 2015
49,4G scanned out of 97,4G at 11,2M/s, 1h12m to go
49,4G resilvered, 50,74% done
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
replacing-0 UNAVAIL 0 0 0
14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3/old
ada0p3 ONLINE 0 0 0 (resilvering)
ada1p3 ONLINE 0 0 0
errors: No known data errors
Less than one quarter remain:
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Apr 21 16:22:28 2015
73,5G scanned out of 97,4G at 13,1M/s, 0h31m to go
73,5G resilvered, 75,47% done
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
replacing-0 UNAVAIL 0 0 0
14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3/old
ada0p3 ONLINE 0 0 0 (resilvering)
ada1p3 ONLINE 0 0 0
errors: No known data errors
Here are the last three outputs from the zpool status command:
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Apr 21 16:22:28 2015
97,3G scanned out of 97,4G at 15,3M/s, 0h0m to go
97,2G resilvered, 99,88% done
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
replacing-0 UNAVAIL 0 0 0
14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3/old
ada0p3 ONLINE 0 0 0 (resilvering)
ada1p3 ONLINE 0 0 0
errors: No known data errors
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Apr 21 16:22:28 2015
97,4G scanned out of 97,4G at 15,3M/s, (scan is slow, no estimated time)
97,4G resilvered, 100,02% done
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
replacing-0 UNAVAIL 0 0 0
14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3/old
ada0p3 ONLINE 0 0 0 (resilvering)
ada1p3 ONLINE 0 0 0
errors: No known data errors
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: ONLINE
scan: resilvered 97,5G in 1h48m with 0 errors on Tue Apr 21 18:11:03 2015
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
errors: No known data errors
This time the resilvering lasted for 01:40:35. There's no reason to run the zpool clear command.
Here are the S.M.A.R.T. attributes of the new /dev/ada0 drive:
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 006 Pre-fail Always - 82888 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 6 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 765119 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 2 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 6 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 065 051 045 Old_age Always - 35 (Min/Max 25/36) 194 Temperature_Celsius 0x0022 035 049 000 Old_age Always - 35 (0 25 0 0 0) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 82888 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 2h+05m+56.015s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 200088940 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 46180
It's time to run short, conveyance, and long S.M.A.R.T. tests on /dev/ada0:
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 3 - # 2 Conveyance offline Completed without error 00% 2 - # 3 Short offline Completed without error 00% 2 -
Update 2015-04-22
I ran a scrub on the enterprise_zroot pool:
root@enterprise:~>zpool status -v enterprise_zroot
pool: enterprise_zroot
state: ONLINE
scan: scrub repaired 0 in 1h44m with 0 errors on Wed Apr 22 10:13:09 2015
config:
NAME STATE READ WRITE CKSUM
enterprise_zroot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
errors: No known data errors