In brief: I want to know if an HDD with failed SMART can be repaired by any means, and if so, is it still reliable enough.
In details: I have a 4 years old 1TB Western Digital HDD (WD10JPVX-08JC3T6) with no previous problems.
disk /dev/sda: 931.5 gib, 1000204886016 bytes, 1953525168 sectors
units: sectors of 1 * 512 = 512 bytes
sector size (logical/physical): 512 bytes / 4096 bytes
i/o size (minimum/optimal): 4096 bytes / 4096 bytes
disklabel type: gpt
disk identifier: c700a041-8c28-42e8-9adb-24a5f86b961a
device start end sectors size type
/dev/sda1 2048 1050623 1048576 512m efi system
/dev/sda2 1050624 1936945151 1935894528 923.1g linux filesystem
/dev/sda3 1936945152 1953523711 16578560 7.9g linux swap
All of a sudden (worth mentioning: probably in a high humidity condition) I found that my Debian root partition sda2 is read-only. I ran a samrtctl long test, which was completed with read failure, Current_Pending_Sector was 109 and Reallocated_Event_Count was 0.
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-29-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Blue Mobile
Device Model: WDC WD10JPVX-08JC3T6
Serial Number: WD-WX31A27AYEN4
LU WWN Device Id: 5 0014ee 65cbcd3de
Firmware Version: 08.01A08
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Sep 23 14:49:11 2020 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (18000) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 202) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x7035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 677
3 Spin_Up_Time 0x0027 189 179 021 Pre-fail Always - 1533
4 Start_Stop_Count 0x0032 091 091 000 Old_age Always - 9943
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 13700
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1834
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 119
193 Load_Cycle_Count 0x0032 139 139 000 Old_age Always - 185919
194 Temperature_Celsius 0x0022 113 089 000 Old_age Always - 34
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 109
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1
240 Head_Flying_Hours 0x0032 084 084 000 Old_age Always - 12291
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
1 Extended offline Completed: read failure 40% 13684 1385167592
2 Short offline Completed without error 00% 13682 -
3 Extended offline Completed: read failure 30% 13678 1385167592
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
So I unmounted sda2, took the back up and ran a e2fsck -fccky on it (which runs badblocks internally), which complained for IO errors in a continuous group of blocks, while also repairing the filesystem.
Then, hoping this had helped anything, I ran another smartctl long test, only to find out that Current_Pending_Sector is increased to 786 and the LBA_of_first_error is much smaller now.
Following many people who had considered a drive with a failed SMART as a dead drive (like many answers here), I was ready to let my HDD go, till I found some place (with no affiliation to WD) who claimed they can 'repair' my HDD, with a tool called PC-3000. They did their job and said that the drive is healthy now, but I couldn't confirm: I ran another smartctl long test and it was a read failure again, but all my previous SMART reports were gone and this time, both Current_Pending_Sector and Reallocated_Event_Count were 0. I also ran another badblocks on the drive only to find the same IO errors. I even dded the reported blocks to confirm that they can't be read. Their technician ignorantly insisted that I should just install Windows on the drive to see if it's working. Certain that the Windows installer won't even be able to make an NTFS filesystem there, I just made a small 2M partition around the error spot (which was about 744 blocks of 512B) and ran a complete mkfs.ntfs (with zeroing) there. To my surprise, the filesystem was created successfully. I mounted it and was able to read/write the whole partition. Once again I dded those bad blocks, and successfully read them this time. And at last, I ran another smartctl long test, which also passed successfully. (Although still with a high Raw_Read_Error_Rate.)
Here you can see the result of the tests #2 and #1, which were done before and after the mkfs.ntfs, respectively.
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-29-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Blue Mobile
Device Model: WDC WD10JPVX-08JC3T6
Serial Number: WD-WX31A27AYEN4
LU WWN Device Id: 5 0014ee 65cbcd3de
Firmware Version: 08.01A08
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Oct 8 05:54:51 2020 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (18000) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 202) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x7035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 410
3 Spin_Up_Time 0x0027 189 184 021 Pre-fail Always - 1550
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 34
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 44
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 9
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 33
194 Temperature_Celsius 0x0022 108 095 000 Old_age Always - 39
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 2
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
240 Head_Flying_Hours 0x0032 100 100 000 Old_age Always - 34
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
1 Extended offline Completed without error 00% 44 -
2 Extended offline Completed: read failure 40% 29 1345188144
3 Short offline Completed without error 00% 27 -
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Then I also ran another badblocks on the whole drive, which found no errors.
Update: I just restored my backup image‒which means writing on every single virtually available block on the disk‒and ran another successful smartctl -t long.
Now to summarize the above:
- The drive had a failed SMART, and blocks with IO errors, which were apparantly increasing,
- something unknown to me was done to the drive, using this
PC-3000, - the drive was in the same state, SMART still failed but it's previous data was gone,
- I
mkfs.ntfsed the error spot, and - suddenly errors were gone and SMART test passed successfully.
- Notice that I didn't explicitly write on the error spot, although I guess
badblocksdoes this anyway.
My questions:
Is there any explanations for what exactly happened? Was it really damaged and was it really fixed? How then? My simple guesses:
A. I just misinterpretted the SMART test as failing. It was good from the beginning.
B. The
PC-3000had done it's job, but the drive was just waiting for a write on the error spot to do whatever it did to repair them. (like remapping the blocks)- I don't think
mkfs.ntfsdid anything except just writing zeroes (or probably its filesystem stuff) on the error spot, right?
- I don't think
Is my drive reliable now? Can I use it with no more concerns? And if so, does that mean a drive with failed SMART can be repaired?
What might this
PC-3000possibly do? Is it really a 'hardware fix' for physically damaged drive?