/dev/ada1 failing

smartd sent me not one, but two emails today, regarding /dev/ada1.

From: Superuser <root@enterprise.ximalas.info>
To: hostmaster@ximalas.info
Date: Mon, 13 Apr 2015 12:33:34 +0200 (CEST)
Subject: SMART error (Health) detected on host: enterprise

This message was generated by the smartd daemon running on:

   host name:  enterprise
   DNS domain: ximalas.info

The following warning/error was logged by the smartd daemon:

Device: /dev/ada1, FAILED SMART self-check. BACK UP DATA NOW!

Device info:
ST500DM002-1BD142, S/N:[withheld], WWN:5-000c50-03f5711f2, FW:KC45, 500 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.
From: Superuser <root@enterprise.ximalas.info>
To: hostmaster@ximalas.info
Date: Mon, 13 Apr 2015 12:33:37 +0200 (CEST)
Subject: SMART error (Usage) detected on host: enterprise

This message was generated by the smartd daemon running on:

   host name:  enterprise
   DNS domain: ximalas.info

The following warning/error was logged by the smartd daemon:

Device: /dev/ada1, Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.

Device info:
ST500DM002-1BD142, S/N:[withheld], WWN:5-000c50-03f5711f2, FW:KC45, 500 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.

smartctl has this to say about /dev/ada1 and its S.M.A.R.T. attributes:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   099   006    Pre-fail  Always       -       3136808
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       59
  5 Reallocated_Sector_Ct   0x0033   035   035   036    Pre-fail  Always   FAILING_NOW 21480
  7 Seek_Error_Rate         0x000f   087   060   030    Pre-fail  Always       -       601807587
  9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       28475
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       59
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       0 0 2
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   069   053   045    Old_age   Always       -       31 (Min/Max 30/36)
194 Temperature_Celsius     0x0022   031   047   000    Old_age   Always       -       31 (0 19 0 0 0)
195 Hardware_ECC_Recovered  0x001a   034   028   000    Old_age   Always       -       3136808
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       4
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       28475h+00m+17.596s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2621705511
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2832710658

The 4 instances of CRC errors indicate that something could be wrong with the SATA cable. The raw read error rate is suspicious when compared to that of /dev/ada0, it’s more than twice as big.

/dev/ada0 has been /dev/ada1‘s partner for the last 3 years or so, and here are the former’s S.M.A.R.T. attributes:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   099   099   006    Pre-fail  Always       -       1502664
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       59
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   087   060   030    Pre-fail  Always       -       623196340
  9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       28475
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       59
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   068   053   045    Old_age   Always       -       32 (Min/Max 31/37)
194 Temperature_Celsius     0x0022   032   047   000    Old_age   Always       -       32 (0 19 0 0 0)
195 Hardware_ECC_Recovered  0x001a   038   023   000    Old_age   Always       -       1502664
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       28475h+59m+05.142s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       412698749
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       3839349951

ZFS is still happy as ever:

root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: ONLINE
  scan: scrub repaired 0 in 1h52m with 0 errors on Sun Apr 12 05:40:43 2015
config:

        NAME              STATE     READ WRITE CKSUM
        enterprise_zroot  ONLINE       0     0     0
          mirror-0        ONLINE       0     0     0
            ada0p3        ONLINE       0     0     0
            ada1p3        ONLINE       0     0     0

errors: No known data errors

/dev/ada1 is partitioned like this:

root@enterprise:~>gpart show -l ada1
=>       34  976773101  ada1  GPT  (465G)
         34          6        - free -  (3.0k)
         40        256     1  gptboot1  (128k)
        296       1752        - free -  (876k)
       2048   33554432     2  swap1  (16G)
   33556480  943216648     3  enterprise_zroot1  (449G)
  976773128          7        - free -  (3.5k)

For completeness, here are the S.M.A.R.T. attributes for the three remaining harddrives:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   105   099   006    Pre-fail  Always       -       317864
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       59
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   088   060   030    Pre-fail  Always       -       684637847
  9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       28478
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       59
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   070   056   045    Old_age   Always       -       30 (Min/Max 30/35)
194 Temperature_Celsius     0x0022   030   044   000    Old_age   Always       -       30 (0 19 0 0 0)
195 Hardware_ECC_Recovered  0x001a   051   030   000    Old_age   Always       -       317864
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       28477h+42m+14.509s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       1111675742
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1742318989
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   105   099   006    Pre-fail  Always       -       56200
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       59
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   087   060   030    Pre-fail  Always       -       666707923
  9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       28475
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       59
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       1 1 1
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   069   051   045    Old_age   Always       -       31 (Min/Max 30/36)
194 Temperature_Celsius     0x0022   031   049   000    Old_age   Always       -       31 (0 20 0 0 0)
195 Hardware_ECC_Recovered  0x001a   059   033   000    Old_age   Always       -       56200
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       28474h+55m+46.515s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2559712939
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1733063492
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   104   099   006    Pre-fail  Always       -       511904
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       59
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   088   060   030    Pre-fail  Always       -       678977376
  9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       28477
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       59
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   096   096   000    Old_age   Always       -       4
190 Airflow_Temperature_Cel 0x0022   071   053   045    Old_age   Always       -       29 (Min/Max 28/33)
194 Temperature_Celsius     0x0022   029   047   000    Old_age   Always       -       29 (0 19 0 0 0)
195 Hardware_ECC_Recovered  0x001a   055   030   000    Old_age   Always       -       511904
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       28476h+57m+17.472s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       4252139494
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       398472105

I promptly ordered 6 new ST500DM002 harddrives, and I expect them to arrive within the week. Hopefully, the system will be able to keep itself afloat until I can replace the wonky harddrive.


Update 2015-04-20
The failling /dev/ada1 drive was replaced today.

Before replacing the drive, I took a backup of the GPT using gpart backup ada1 > /root/gpart.ada1.txt.

After replacing the drive:

  1. I booted into single user mode,
  2. ran gpart restore -l ada1 < /root/gpart.ada1.txt,
  3. ran gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1,
  4. rebooted to multi user mode,
  5. ran zpool online enterprise_zroot ada1p3 as suggested by the zpool status command, and
  6. realised zpool replace enterprise_zroot ada1p3 is really the way to go.

I assume the zpool online command is responsible for the 744 checksum errors listed below.

ZFS began resilvering the ada1p3 partition:

root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Apr 20 17:12:37 2015
        11,0G scanned out of 110G at 9,11M/s, 3h4m to go
        11,0G resilvered, 10,04% done
config:

        NAME                      STATE     READ WRITE CKSUM
        enterprise_zroot          DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            ada0p3                ONLINE       0     0     0
            replacing-1           UNAVAIL      0     0     0
              776201329010632765  UNAVAIL      0     0     0  was /dev/ada1p3/old
              ada1p3              ONLINE       0     0   744  (resilvering)

errors: No known data errors

Just sit back and enjoy the ride of your life … :P

Here are the last three outputs from the zpool status command:

root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Apr 20 17:12:37 2015
        109G scanned out of 110G at 15,1M/s, 0h0m to go
        109G resilvered, 99,81% done
config:

        NAME                      STATE     READ WRITE CKSUM
        enterprise_zroot          DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            ada0p3                ONLINE       0     0     0
            replacing-1           UNAVAIL      0     0     0
              776201329010632765  UNAVAIL      0     0     0  was /dev/ada1p3/old
              ada1p3              ONLINE       0     0   744  (resilvering)

errors: No known data errors
root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Apr 20 17:12:37 2015
        110G scanned out of 110G at 15,1M/s, (scan is slow, no estimated time)
        110G resilvered, 100,11% done
config:

        NAME                      STATE     READ WRITE CKSUM
        enterprise_zroot          DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            ada0p3                ONLINE       0     0     0
            replacing-1           UNAVAIL      0     0     0
              776201329010632765  UNAVAIL      0     0     0  was /dev/ada1p3/old
              ada1p3              ONLINE       0     0   744  (resilvering)

errors: No known data errors
root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 110G in 2h3m with 0 errors on Mon Apr 20 19:16:36 2015
config:

        NAME              STATE     READ WRITE CKSUM
        enterprise_zroot  ONLINE       0     0     0
          mirror-0        ONLINE       0     0     0
            ada0p3        ONLINE       0     0     0
            ada1p3        ONLINE       0     0   744

errors: No known data errors

A subsequent zpool clear reset all counters:

root@enterprise:~>zpool clear enterprise_zroot
root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: ONLINE
  scan: resilvered 110G in 2h3m with 0 errors on Mon Apr 20 19:16:36 2015
config:

        NAME              STATE     READ WRITE CKSUM
        enterprise_zroot  ONLINE       0     0     0
          mirror-0        ONLINE       0     0     0
            ada0p3        ONLINE       0     0     0
            ada1p3        ONLINE       0     0     0

errors: No known data errors

Resilvering took about 02:03:59.

The /dev/ada1 drive developed one CRC error, indicating a possible problem with the SATA cable or possibly the SATA connector on the motherboard:

Apr 20 17:17:10 <kern.crit> enterprise kernel: [450] (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 a0 25 bf 40 1e 00 00 01 00 00
Apr 20 17:17:10 <kern.crit> enterprise kernel: [450] (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
Apr 20 17:17:10 <kern.crit> enterprise kernel: [450] (ada1:ahcich1:0:0:0): Retrying command

Here are the S.M.A.R.T. attributes of the new /dev/ada1 drive:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   006    Pre-fail  Always       -       248
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       6
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   030    Pre-fail  Always       -       829317
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       2
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       6
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   054   045    Old_age   Always       -       33 (Min/Max 25/33)
194 Temperature_Celsius     0x0022   033   046   000    Old_age   Always       -       33 (0 25 0 0 0)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       248
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       2h+15m+19.425s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       225023812
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       62700

Next, I ran short, conveyance, and long S.M.A.R.T. tests on both /dev/ada0 and /dev/ada1. The results are shown below.

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     28650         -
# 2  Conveyance offline  Completed without error       00%     28649         -
# 3  Short offline       Completed without error       00%     28648         -
# 4  Extended offline    Completed without error       00%     20635         -
# 5  Short offline       Completed without error       00%     20633         -
# 6  Extended offline    Completed without error       00%      3721         -
# 7  Short offline       Completed without error       00%      3720         -
# 8  Extended offline    Completed without error       00%      3363         -
# 9  Short offline       Completed without error       00%      3361         -
#10  Extended offline    Completed without error       00%      3000         -
#11  Short offline       Completed without error       00%      2999         -
#12  Extended offline    Completed without error       00%      1451         -
#13  Extended offline    Completed without error       00%      1311         -
#14  Extended offline    Completed without error       00%       613         -
#15  Extended offline    Completed without error       00%       550         -
#16  Short offline       Completed without error       00%       548         -
#17  Extended offline    Completed without error       00%       259         -
#18  Conveyance offline  Completed without error       00%       134         -
#19  Extended offline    Completed without error       00%        14         -
#20  Short offline       Completed without error       00%        12         -
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         4         -
# 2  Conveyance offline  Completed without error       00%         2         -
# 3  Short offline       Completed without error       00%         2         -

Update 2015-04-21
I ran a scrub on the enterprise_zroot pool this morning, all OK:

root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: ONLINE
  scan: scrub repaired 0 in 1h47m with 0 errors on Tue Apr 21 10:07:31 2015
config:

        NAME              STATE     READ WRITE CKSUM
        enterprise_zroot  ONLINE       0     0     0
          mirror-0        ONLINE       0     0     0
            ada0p3        ONLINE       0     0     0
            ada1p3        ONLINE       0     0     0

errors: No known data errors

In the afternoon I replaced the /dev/ada0 drive, the system booted automatically from the /dev/ada1 drive. I selected single user mode from the boot loader's menu, restored the GPT partition table in essentially the same way as I did yesterday, and installed the necessary boot blocks. I rebooted the system and let it complete booting to multi user mode.

ZFS complained about the missing ada0p3 member, I issued the zpool replace command, and ZFS responded by resilvering the new /dev/ada0 drive:

root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 1h47m with 0 errors on Tue Apr 21 10:07:31 2015
config:

        NAME                      STATE     READ WRITE CKSUM
        enterprise_zroot          DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            14125420615616302625  UNAVAIL      0     0     0  was /dev/ada0p3
            ada1p3                ONLINE       0     0     0

errors: No known data errors
root@enterprise:~>zpool replace enterprise_zroot ada0p3
Make sure to wait until resilver is done before rebooting.

If you boot from pool 'enterprise_zroot', you may need to update
boot code on newly attached disk 'ada0p3'.

Assuming you use GPT partitioning and 'da0' is your new boot disk
you may use the following command:

        gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Apr 21 16:22:28 2015
        34,1M scanned out of 97,4G at 3,79M/s, 7h18m to go
        33,7M resilvered, 0,03% done
config:

        NAME                        STATE     READ WRITE CKSUM
        enterprise_zroot            DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            replacing-0             UNAVAIL      0     0     0
              14125420615616302625  UNAVAIL      0     0     0  was /dev/ada0p3/old
              ada0p3                ONLINE       0     0     0  (resilvering)
            ada1p3                  ONLINE       0     0     0

errors: No known data errors

As you might have noticed, the amount of data to be resilvered today is less than on the previous afternoon. I decided to wipe clean /usr/obj and /usr/obj-10 prior to replacing the /dev/ada0 drive, and thus gained some gigabytes of free space.

For some reason, today's resilvering is slower than yesterday's:

root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Apr 21 16:22:28 2015
        4,53G scanned out of 97,4G at 2,49M/s, 10h36m to go
        4,53G resilvered, 4,65% done
config:

        NAME                        STATE     READ WRITE CKSUM
        enterprise_zroot            DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            replacing-0             UNAVAIL      0     0     0
              14125420615616302625  UNAVAIL      0     0     0  was /dev/ada0p3/old
              ada0p3                ONLINE       0     0     0  (resilvering)
            ada1p3                  ONLINE       0     0     0

errors: No known data errors

Maybe the /dev/ada1 SATA cable is wonky. Or maybe I spoke too soon. After I yelled a bit, the pace suddenly increased:

root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Apr 21 16:22:28 2015
        16,1G scanned out of 97,4G at 7,16M/s, 3h13m to go
        16,1G resilvered, 16,52% done
config:

        NAME                        STATE     READ WRITE CKSUM
        enterprise_zroot            DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            replacing-0             UNAVAIL      0     0     0
              14125420615616302625  UNAVAIL      0     0     0  was /dev/ada0p3/old
              ada0p3                ONLINE       0     0     0  (resilvering)
            ada1p3                  ONLINE       0     0     0

errors: No known data errors

Resilvering is now more than a third on its way:

root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Apr 21 16:22:28 2015
        35,7G scanned out of 97,4G at 10,3M/s, 1h42m to go
        35,7G resilvered, 36,68% done
config:

        NAME                        STATE     READ WRITE CKSUM
        enterprise_zroot            DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            replacing-0             UNAVAIL      0     0     0
              14125420615616302625  UNAVAIL      0     0     0  was /dev/ada0p3/old
              ada0p3                ONLINE       0     0     0  (resilvering)
            ada1p3                  ONLINE       0     0     0

errors: No known data errors

Aaaand, we're halfway through the resilvering process:

root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Apr 21 16:22:28 2015
        49,4G scanned out of 97,4G at 11,2M/s, 1h12m to go
        49,4G resilvered, 50,74% done
config:

        NAME                        STATE     READ WRITE CKSUM
        enterprise_zroot            DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            replacing-0             UNAVAIL      0     0     0
              14125420615616302625  UNAVAIL      0     0     0  was /dev/ada0p3/old
              ada0p3                ONLINE       0     0     0  (resilvering)
            ada1p3                  ONLINE       0     0     0

errors: No known data errors

Less than one quarter remain:

root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Apr 21 16:22:28 2015
        73,5G scanned out of 97,4G at 13,1M/s, 0h31m to go
        73,5G resilvered, 75,47% done
config:

        NAME                        STATE     READ WRITE CKSUM
        enterprise_zroot            DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            replacing-0             UNAVAIL      0     0     0
              14125420615616302625  UNAVAIL      0     0     0  was /dev/ada0p3/old
              ada0p3                ONLINE       0     0     0  (resilvering)
            ada1p3                  ONLINE       0     0     0

errors: No known data errors

Here are the last three outputs from the zpool status command:

root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Apr 21 16:22:28 2015
        97,3G scanned out of 97,4G at 15,3M/s, 0h0m to go
        97,2G resilvered, 99,88% done
config:

        NAME                        STATE     READ WRITE CKSUM
        enterprise_zroot            DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            replacing-0             UNAVAIL      0     0     0
              14125420615616302625  UNAVAIL      0     0     0  was /dev/ada0p3/old
              ada0p3                ONLINE       0     0     0  (resilvering)
            ada1p3                  ONLINE       0     0     0

errors: No known data errors
root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Apr 21 16:22:28 2015
        97,4G scanned out of 97,4G at 15,3M/s, (scan is slow, no estimated time)
        97,4G resilvered, 100,02% done
config:

        NAME                        STATE     READ WRITE CKSUM
        enterprise_zroot            DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            replacing-0             UNAVAIL      0     0     0
              14125420615616302625  UNAVAIL      0     0     0  was /dev/ada0p3/old
              ada0p3                ONLINE       0     0     0  (resilvering)
            ada1p3                  ONLINE       0     0     0

errors: No known data errors
root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: ONLINE
  scan: resilvered 97,5G in 1h48m with 0 errors on Tue Apr 21 18:11:03 2015
config:

        NAME              STATE     READ WRITE CKSUM
        enterprise_zroot  ONLINE       0     0     0
          mirror-0        ONLINE       0     0     0
            ada0p3        ONLINE       0     0     0
            ada1p3        ONLINE       0     0     0

errors: No known data errors

This time the resilvering lasted for 01:40:35. There's no reason to run the zpool clear command.

Here are the S.M.A.R.T. attributes of the new /dev/ada0 drive:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   006    Pre-fail  Always       -       82888
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       6
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   030    Pre-fail  Always       -       765119
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       2
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       6
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   065   051   045    Old_age   Always       -       35 (Min/Max 25/36)
194 Temperature_Celsius     0x0022   035   049   000    Old_age   Always       -       35 (0 25 0 0 0)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       82888
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       2h+05m+56.015s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       200088940
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       46180

It's time to run short, conveyance, and long S.M.A.R.T. tests on /dev/ada0:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         3         -
# 2  Conveyance offline  Completed without error       00%         2         -
# 3  Short offline       Completed without error       00%         2         -

Update 2015-04-22
I ran a scrub on the enterprise_zroot pool:

root@enterprise:~>zpool status -v enterprise_zroot
  pool: enterprise_zroot
 state: ONLINE
  scan: scrub repaired 0 in 1h44m with 0 errors on Wed Apr 22 10:13:09 2015
config:

        NAME              STATE     READ WRITE CKSUM
        enterprise_zroot  ONLINE       0     0     0
          mirror-0        ONLINE       0     0     0
            ada0p3        ONLINE       0     0     0
            ada1p3        ONLINE       0     0     0

errors: No known data errors