/dev/ada1 failing
smartd
sent me not one, but two emails today, regarding /dev/ada1
.
From: Superuser <root@enterprise.ximalas.info> To: hostmaster@ximalas.info Date: Mon, 13 Apr 2015 12:33:34 +0200 (CEST) Subject: SMART error (Health) detected on host: enterprise This message was generated by the smartd daemon running on: host name: enterprise DNS domain: ximalas.info The following warning/error was logged by the smartd daemon: Device: /dev/ada1, FAILED SMART self-check. BACK UP DATA NOW! Device info: ST500DM002-1BD142, S/N:[withheld], WWN:5-000c50-03f5711f2, FW:KC45, 500 GB For details see host's SYSLOG. You can also use the smartctl utility for further investigation. No additional messages about this problem will be sent.
From: Superuser <root@enterprise.ximalas.info> To: hostmaster@ximalas.info Date: Mon, 13 Apr 2015 12:33:37 +0200 (CEST) Subject: SMART error (Usage) detected on host: enterprise This message was generated by the smartd daemon running on: host name: enterprise DNS domain: ximalas.info The following warning/error was logged by the smartd daemon: Device: /dev/ada1, Failed SMART usage Attribute: 5 Reallocated_Sector_Ct. Device info: ST500DM002-1BD142, S/N:[withheld], WWN:5-000c50-03f5711f2, FW:KC45, 500 GB For details see host's SYSLOG. You can also use the smartctl utility for further investigation. No additional messages about this problem will be sent.
smartctl
has this to say about /dev/ada1
and its S.M.A.R.T. attributes:
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 099 006 Pre-fail Always - 3136808 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 59 5 Reallocated_Sector_Ct 0x0033 035 035 036 Pre-fail Always FAILING_NOW 21480 7 Seek_Error_Rate 0x000f 087 060 030 Pre-fail Always - 601807587 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 28475 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 59 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 099 000 Old_age Always - 0 0 2 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 069 053 045 Old_age Always - 31 (Min/Max 30/36) 194 Temperature_Celsius 0x0022 031 047 000 Old_age Always - 31 (0 19 0 0 0) 195 Hardware_ECC_Recovered 0x001a 034 028 000 Old_age Always - 3136808 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 4 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 28475h+00m+17.596s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2621705511 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 2832710658
The 4 instances of CRC errors indicate that something could be wrong with the SATA cable. The raw read error rate is suspicious when compared to that of /dev/ada0
, it’s more than twice as big.
/dev/ada0
has been /dev/ada1
‘s partner for the last 3 years or so, and here are the former’s S.M.A.R.T. attributes:
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 099 099 006 Pre-fail Always - 1502664 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 59 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 087 060 030 Pre-fail Always - 623196340 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 28475 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 59 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 068 053 045 Old_age Always - 32 (Min/Max 31/37) 194 Temperature_Celsius 0x0022 032 047 000 Old_age Always - 32 (0 19 0 0 0) 195 Hardware_ECC_Recovered 0x001a 038 023 000 Old_age Always - 1502664 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 28475h+59m+05.142s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 412698749 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3839349951
ZFS is still happy as ever:
root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: ONLINE scan: scrub repaired 0 in 1h52m with 0 errors on Sun Apr 12 05:40:43 2015 config: NAME STATE READ WRITE CKSUM enterprise_zroot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 errors: No known data errors
/dev/ada1
is partitioned like this:
root@enterprise:~>gpart show -l ada1 => 34 976773101 ada1 GPT (465G) 34 6 - free - (3.0k) 40 256 1 gptboot1 (128k) 296 1752 - free - (876k) 2048 33554432 2 swap1 (16G) 33556480 943216648 3 enterprise_zroot1 (449G) 976773128 7 - free - (3.5k)
For completeness, here are the S.M.A.R.T. attributes for the three remaining harddrives:
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 105 099 006 Pre-fail Always - 317864 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 59 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 088 060 030 Pre-fail Always - 684637847 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 28478 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 59 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 070 056 045 Old_age Always - 30 (Min/Max 30/35) 194 Temperature_Celsius 0x0022 030 044 000 Old_age Always - 30 (0 19 0 0 0) 195 Hardware_ECC_Recovered 0x001a 051 030 000 Old_age Always - 317864 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 28477h+42m+14.509s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 1111675742 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1742318989
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 105 099 006 Pre-fail Always - 56200 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 59 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 087 060 030 Pre-fail Always - 666707923 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 28475 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 59 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 1 1 1 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 069 051 045 Old_age Always - 31 (Min/Max 30/36) 194 Temperature_Celsius 0x0022 031 049 000 Old_age Always - 31 (0 20 0 0 0) 195 Hardware_ECC_Recovered 0x001a 059 033 000 Old_age Always - 56200 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 28474h+55m+46.515s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2559712939 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1733063492
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 104 099 006 Pre-fail Always - 511904 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 59 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 088 060 030 Pre-fail Always - 678977376 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 28477 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 59 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 096 096 000 Old_age Always - 4 190 Airflow_Temperature_Cel 0x0022 071 053 045 Old_age Always - 29 (Min/Max 28/33) 194 Temperature_Celsius 0x0022 029 047 000 Old_age Always - 29 (0 19 0 0 0) 195 Hardware_ECC_Recovered 0x001a 055 030 000 Old_age Always - 511904 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 28476h+57m+17.472s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 4252139494 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 398472105
I promptly ordered 6 new ST500DM002 harddrives, and I expect them to arrive within the week. Hopefully, the system will be able to keep itself afloat until I can replace the wonky harddrive.
Update 2015-04-20
The failling /dev/ada1
drive was replaced today.
Before replacing the drive, I took a backup of the GPT using gpart backup ada1 > /root/gpart.ada1.txt
.
After replacing the drive:
- I booted into single user mode,
- ran
gpart restore -l ada1 < /root/gpart.ada1.txt
, - ran
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1
, - rebooted to multi user mode,
- ran
zpool online enterprise_zroot ada1p3
as suggested by thezpool status
command, and - realised
zpool replace enterprise_zroot ada1p3
is really the way to go.
I assume the zpool online
command is responsible for the 744 checksum errors listed below.
ZFS began resilvering the ada1p3
partition:
root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Apr 20 17:12:37 2015 11,0G scanned out of 110G at 9,11M/s, 3h4m to go 11,0G resilvered, 10,04% done config: NAME STATE READ WRITE CKSUM enterprise_zroot DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 ada0p3 ONLINE 0 0 0 replacing-1 UNAVAIL 0 0 0 776201329010632765 UNAVAIL 0 0 0 was /dev/ada1p3/old ada1p3 ONLINE 0 0 744 (resilvering) errors: No known data errors
Just sit back and enjoy the ride of your life … :P
Here are the last three outputs from the zpool status
command:
root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Apr 20 17:12:37 2015 109G scanned out of 110G at 15,1M/s, 0h0m to go 109G resilvered, 99,81% done config: NAME STATE READ WRITE CKSUM enterprise_zroot DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 ada0p3 ONLINE 0 0 0 replacing-1 UNAVAIL 0 0 0 776201329010632765 UNAVAIL 0 0 0 was /dev/ada1p3/old ada1p3 ONLINE 0 0 744 (resilvering) errors: No known data errors root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Apr 20 17:12:37 2015 110G scanned out of 110G at 15,1M/s, (scan is slow, no estimated time) 110G resilvered, 100,11% done config: NAME STATE READ WRITE CKSUM enterprise_zroot DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 ada0p3 ONLINE 0 0 0 replacing-1 UNAVAIL 0 0 0 776201329010632765 UNAVAIL 0 0 0 was /dev/ada1p3/old ada1p3 ONLINE 0 0 744 (resilvering) errors: No known data errors root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: resilvered 110G in 2h3m with 0 errors on Mon Apr 20 19:16:36 2015 config: NAME STATE READ WRITE CKSUM enterprise_zroot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 744 errors: No known data errors
A subsequent zpool clear
reset all counters:
root@enterprise:~>zpool clear enterprise_zroot root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: ONLINE scan: resilvered 110G in 2h3m with 0 errors on Mon Apr 20 19:16:36 2015 config: NAME STATE READ WRITE CKSUM enterprise_zroot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 errors: No known data errors
Resilvering took about 02:03:59.
The /dev/ada1
drive developed one CRC error, indicating a possible problem with the SATA cable or possibly the SATA connector on the motherboard:
Apr 20 17:17:10 <kern.crit> enterprise kernel: [450] (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 a0 25 bf 40 1e 00 00 01 00 00 Apr 20 17:17:10 <kern.crit> enterprise kernel: [450] (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error Apr 20 17:17:10 <kern.crit> enterprise kernel: [450] (ada1:ahcich1:0:0:0): Retrying command
Here are the S.M.A.R.T. attributes of the new /dev/ada1
drive:
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 006 Pre-fail Always - 248 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 6 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 829317 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 2 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 6 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 067 054 045 Old_age Always - 33 (Min/Max 25/33) 194 Temperature_Celsius 0x0022 033 046 000 Old_age Always - 33 (0 25 0 0 0) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 248 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 1 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 2h+15m+19.425s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 225023812 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 62700
Next, I ran short, conveyance, and long S.M.A.R.T. tests on both /dev/ada0
and /dev/ada1
. The results are shown below.
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 28650 - # 2 Conveyance offline Completed without error 00% 28649 - # 3 Short offline Completed without error 00% 28648 - # 4 Extended offline Completed without error 00% 20635 - # 5 Short offline Completed without error 00% 20633 - # 6 Extended offline Completed without error 00% 3721 - # 7 Short offline Completed without error 00% 3720 - # 8 Extended offline Completed without error 00% 3363 - # 9 Short offline Completed without error 00% 3361 - #10 Extended offline Completed without error 00% 3000 - #11 Short offline Completed without error 00% 2999 - #12 Extended offline Completed without error 00% 1451 - #13 Extended offline Completed without error 00% 1311 - #14 Extended offline Completed without error 00% 613 - #15 Extended offline Completed without error 00% 550 - #16 Short offline Completed without error 00% 548 - #17 Extended offline Completed without error 00% 259 - #18 Conveyance offline Completed without error 00% 134 - #19 Extended offline Completed without error 00% 14 - #20 Short offline Completed without error 00% 12 -
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 4 - # 2 Conveyance offline Completed without error 00% 2 - # 3 Short offline Completed without error 00% 2 -
Update 2015-04-21
I ran a scrub on the enterprise_zroot
pool this morning, all OK:
root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: ONLINE scan: scrub repaired 0 in 1h47m with 0 errors on Tue Apr 21 10:07:31 2015 config: NAME STATE READ WRITE CKSUM enterprise_zroot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 errors: No known data errors
In the afternoon I replaced the /dev/ada0
drive, the system booted automatically from the /dev/ada1
drive. I selected single user mode from the boot loader's menu, restored the GPT partition table in essentially the same way as I did yesterday, and installed the necessary boot blocks. I rebooted the system and let it complete booting to multi user mode.
ZFS complained about the missing ada0p3
member, I issued the zpool replace
command, and ZFS responded by resilvering the new /dev/ada0
drive:
root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://illumos.org/msg/ZFS-8000-2Q scan: scrub repaired 0 in 1h47m with 0 errors on Tue Apr 21 10:07:31 2015 config: NAME STATE READ WRITE CKSUM enterprise_zroot DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3 ada1p3 ONLINE 0 0 0 errors: No known data errors root@enterprise:~>zpool replace enterprise_zroot ada0p3 Make sure to wait until resilver is done before rebooting. If you boot from pool 'enterprise_zroot', you may need to update boot code on newly attached disk 'ada0p3'. Assuming you use GPT partitioning and 'da0' is your new boot disk you may use the following command: gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0 root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Apr 21 16:22:28 2015 34,1M scanned out of 97,4G at 3,79M/s, 7h18m to go 33,7M resilvered, 0,03% done config: NAME STATE READ WRITE CKSUM enterprise_zroot DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 replacing-0 UNAVAIL 0 0 0 14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3/old ada0p3 ONLINE 0 0 0 (resilvering) ada1p3 ONLINE 0 0 0 errors: No known data errors
As you might have noticed, the amount of data to be resilvered today is less than on the previous afternoon. I decided to wipe clean /usr/obj
and /usr/obj-10
prior to replacing the /dev/ada0
drive, and thus gained some gigabytes of free space.
For some reason, today's resilvering is slower than yesterday's:
root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Apr 21 16:22:28 2015 4,53G scanned out of 97,4G at 2,49M/s, 10h36m to go 4,53G resilvered, 4,65% done config: NAME STATE READ WRITE CKSUM enterprise_zroot DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 replacing-0 UNAVAIL 0 0 0 14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3/old ada0p3 ONLINE 0 0 0 (resilvering) ada1p3 ONLINE 0 0 0 errors: No known data errors
Maybe the /dev/ada1
SATA cable is wonky. Or maybe I spoke too soon. After I yelled a bit, the pace suddenly increased:
root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Apr 21 16:22:28 2015 16,1G scanned out of 97,4G at 7,16M/s, 3h13m to go 16,1G resilvered, 16,52% done config: NAME STATE READ WRITE CKSUM enterprise_zroot DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 replacing-0 UNAVAIL 0 0 0 14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3/old ada0p3 ONLINE 0 0 0 (resilvering) ada1p3 ONLINE 0 0 0 errors: No known data errors
Resilvering is now more than a third on its way:
root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Apr 21 16:22:28 2015 35,7G scanned out of 97,4G at 10,3M/s, 1h42m to go 35,7G resilvered, 36,68% done config: NAME STATE READ WRITE CKSUM enterprise_zroot DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 replacing-0 UNAVAIL 0 0 0 14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3/old ada0p3 ONLINE 0 0 0 (resilvering) ada1p3 ONLINE 0 0 0 errors: No known data errors
Aaaand, we're halfway through the resilvering process:
root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Apr 21 16:22:28 2015 49,4G scanned out of 97,4G at 11,2M/s, 1h12m to go 49,4G resilvered, 50,74% done config: NAME STATE READ WRITE CKSUM enterprise_zroot DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 replacing-0 UNAVAIL 0 0 0 14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3/old ada0p3 ONLINE 0 0 0 (resilvering) ada1p3 ONLINE 0 0 0 errors: No known data errors
Less than one quarter remain:
root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Apr 21 16:22:28 2015 73,5G scanned out of 97,4G at 13,1M/s, 0h31m to go 73,5G resilvered, 75,47% done config: NAME STATE READ WRITE CKSUM enterprise_zroot DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 replacing-0 UNAVAIL 0 0 0 14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3/old ada0p3 ONLINE 0 0 0 (resilvering) ada1p3 ONLINE 0 0 0 errors: No known data errors
Here are the last three outputs from the zpool status
command:
root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Apr 21 16:22:28 2015 97,3G scanned out of 97,4G at 15,3M/s, 0h0m to go 97,2G resilvered, 99,88% done config: NAME STATE READ WRITE CKSUM enterprise_zroot DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 replacing-0 UNAVAIL 0 0 0 14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3/old ada0p3 ONLINE 0 0 0 (resilvering) ada1p3 ONLINE 0 0 0 errors: No known data errors root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Apr 21 16:22:28 2015 97,4G scanned out of 97,4G at 15,3M/s, (scan is slow, no estimated time) 97,4G resilvered, 100,02% done config: NAME STATE READ WRITE CKSUM enterprise_zroot DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 replacing-0 UNAVAIL 0 0 0 14125420615616302625 UNAVAIL 0 0 0 was /dev/ada0p3/old ada0p3 ONLINE 0 0 0 (resilvering) ada1p3 ONLINE 0 0 0 errors: No known data errors root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: ONLINE scan: resilvered 97,5G in 1h48m with 0 errors on Tue Apr 21 18:11:03 2015 config: NAME STATE READ WRITE CKSUM enterprise_zroot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 errors: No known data errors
This time the resilvering lasted for 01:40:35. There's no reason to run the zpool clear
command.
Here are the S.M.A.R.T. attributes of the new /dev/ada0
drive:
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 006 Pre-fail Always - 82888 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 6 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 765119 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 2 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 6 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 065 051 045 Old_age Always - 35 (Min/Max 25/36) 194 Temperature_Celsius 0x0022 035 049 000 Old_age Always - 35 (0 25 0 0 0) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 82888 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 2h+05m+56.015s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 200088940 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 46180
It's time to run short, conveyance, and long S.M.A.R.T. tests on /dev/ada0
:
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 3 - # 2 Conveyance offline Completed without error 00% 2 - # 3 Short offline Completed without error 00% 2 -
Update 2015-04-22
I ran a scrub on the enterprise_zroot
pool:
root@enterprise:~>zpool status -v enterprise_zroot pool: enterprise_zroot state: ONLINE scan: scrub repaired 0 in 1h44m with 0 errors on Wed Apr 22 10:13:09 2015 config: NAME STATE READ WRITE CKSUM enterprise_zroot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 errors: No known data errors