10

I have an external SSD that suffered some file corruption earlier this week. Model is

Model Family:     Crucial/Micron RealSSD m4/C400/P400
Device Model:     M4-CT256M4SSD2

with, apparently, 20,000 power on hours on the clock.

Even though the status is:

SMART overall-health self-assessment test result: PASSED

the self-testing is failing:

enter image description here

gsmartcontrol reports the attributes as:

enter image description here

Full output is:

smartctl 7.2 2020-12-30 r5155 [x86_64-w64-mingw32-w10-b19045] (sf-7.2-1)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION === Model Family: Crucial/Micron RealSSD m4/C400/P400 Device Model: M4-CT256M4SSD2 Serial Number: 0000000012050904896A LU WWN Device Id: 5 00a075 10904896a Firmware Version: 0309 User Capacity: 256,060,514,304 bytes [256 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Form Factor: 2.5 inches TRIM Command: Available, deterministic Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 6 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Apr 05 11:36:29 2023 PM SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Unavailable APM level is: 254 (maximum performance) Rd look-ahead is: Enabled Write cache is: Enabled DSN feature is: Unavailable ATA Security is: Disabled, NOT FROZEN [SEC1]

=== START OF READ SMART DATA SECTION === SMART Status not supported: Incomplete response, ATA output registers missing SMART overall-health self-assessment test result: PASSED Warning: This result is based on an Attribute check.

General SMART Values: Offline data collection status: (0x80) Offline data collection activity was never started. Auto Offline Data Collection: Enabled. Self-test execution status: ( 117) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: ( 1190) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 19) minutes. Conveyance self-test routine recommended polling time: ( 3) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported.

SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-K 100 100 050 - 0 5 Reallocated_Sector_Ct PO--CK 099 099 010 - 36864 (0 5) 9 Power_On_Hours -O--CK 100 100 001 - 19434 12 Power_Cycle_Count -O--CK 100 100 001 - 626 170 Grown_Failing_Block_Ct PO--CK 099 099 010 - 89 171 Program_Fail_Count -O--CK 100 100 001 - 20 172 Erase_Fail_Count -O--CK 100 100 001 - 64 173 Wear_Leveling_Count PO--CK 083 083 010 - 524 174 Unexpect_Power_Loss_Ct -O--CK 100 100 001 - 5 181 Non4k_Aligned_Access -O---K 100 100 001 - 9248 4153 5094 183 SATA_Iface_Downshift -O--CK 100 100 001 - 0 184 End-to-End_Error PO--CK 100 100 050 - 0 187 Reported_Uncorrect -O--CK 100 100 001 - 582 188 Command_Timeout -O--CK 100 100 001 - 0 189 Factory_Bad_Block_Ct -OSR-- 100 100 001 - 85 194 Temperature_Celsius -O---K 100 100 000 - 0 195 Hardware_ECC_Recovered -O-RCK 100 100 001 - 353 196 Reallocated_Event_Count -O--CK 100 100 001 - 89 197 Current_Pending_Sector -O--CK 100 100 001 - 0 198 Offline_Uncorrectable ----CK 100 100 001 - 0 199 UDMA_CRC_Error_Count -O--CK 100 100 001 - 3 202 Perc_Rated_Life_Used ---RC- 083 083 001 - 17 206 Write_Error_Rate -OSR-- 100 100 001 - 20 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning

General Purpose Log Directory Version 1 SMART Log Directory Version 1 [multi-sector log support] Address Access R/W Size Description 0x00 GPL,SL R/O 1 Log Directory 0x01 SL R/O 1 Summary SMART error log 0x02 SL R/O 51 Comprehensive SMART error log 0x03 GPL R/O 16383 Ext. Comprehensive SMART error log 0x04 GPL,SL R/O 255 Device Statistics log 0x06 SL R/O 1 SMART self-test log 0x07 GPL R/O 3449 Extended self-test log 0x09 SL R/W 1 Selective self-test log 0x10 GPL R/O 1 NCQ Command Error log 0x11 GPL R/O 1 SATA Phy Event Counters log 0x80-0x9f GPL,SL R/W 16 Host vendor specific log 0xa0 GPL VS 2000 Device vendor specific log 0xa0 SL VS 208 Device vendor specific log 0xa1-0xbf GPL,SL VS 1 Device vendor specific log 0xc0 GPL VS 80 Device vendor specific log 0xc1-0xdf GPL,SL VS 1 Device vendor specific log 0xe0 GPL,SL R/W 1 SCT Command/Status 0xe1 GPL,SL R/W 1 SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (16383 sectors) No Errors Logged

SMART Extended Self-test Log size 3449 not supported

SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Completed: read failure 50% 19433 244627776

2 Short offline Completed: read failure 60% 19433 492159152

3 Short offline Completed: read failure 60% 16715 492159152

4 Vendor (0xff) Completed without error 00% 16602 -

5 Vendor (0xff) Completed without error 00% 5107 -

SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version: 3 SCT Version (vendor specific): 1 (0x0001) Device State: Active (0) Current Temperature: 0 Celsius Power Cycle Min/Max Temperature: --/ 0 Celsius Lifetime Min/Max Temperature: --/ 0 Celsius

SCT Temperature History Version: 2 Temperature Sampling Period: 10 minutes Temperature Logging Interval: 10 minutes Min/Max recommended Temperature: 0/70 Celsius Min/Max Temperature Limit: -5/75 Celsius Temperature History Size (Index): 478 (151)

Index Estimated Time Temperature Celsius 152 2023-04-02 04:00 ? - ... ..(473 skipped). .. - 148 2023-04-05 11:00 ? - 149 2023-04-05 11:10 0 - 150 2023-04-05 11:20 0 - 151 2023-04-05 11:30 0 -

SMART WRITE LOG does not return COUNT and LBA_LOW register SCT (Get) Error Recovery Control command failed

Device Statistics (GP Log 0x04) Page Offset Size Value Flags Description 0x01 ===== = = === == General Statistics (rev 2) == 0x01 0x008 4 626 --- Lifetime Power-On Resets 0x01 0x010 4 19434 --- Power-on Hours 0x01 0x018 6 66167492621 --- Logical Sectors Written 0x01 0x020 6 1499672681 --- Number of Write Commands 0x01 0x028 6 138123876618 --- Logical Sectors Read 0x01 0x030 6 2013843720 --- Number of Read Commands 0x04 ===== = = === == General Errors Statistics (rev 1) == 0x04 0x008 4 582 --- Number of Reported Uncorrectable Errors 0x04 0x010 4 0 --- Resets Between Cmd Acceptance and Completion 0x05 ===== = = === == Temperature Statistics (rev 1) == 0x05 0x008 1 0 --- Current Temperature 0x05 0x010 1 0 --- Average Short Term Temperature 0x05 0x018 1 0 --- Average Long Term Temperature 0x05 0x020 1 0 --- Highest Temperature 0x05 0x028 1 0 --- Lowest Temperature 0x05 0x030 1 0 --- Highest Average Short Term Temperature 0x05 0x038 1 0 --- Lowest Average Short Term Temperature 0x05 0x040 1 0 --- Highest Average Long Term Temperature 0x05 0x048 1 0 --- Lowest Average Long Term Temperature 0x05 0x050 4 - --- Time in Over-Temperature 0x05 0x058 1 70 --- Specified Maximum Operating Temperature 0x05 0x060 4 - --- Time in Under-Temperature 0x05 0x068 1 0 --- Specified Minimum Operating Temperature 0x06 ===== = = === == Transport Statistics (rev 1) == 0x06 0x008 4 13903 --- Number of Hardware Resets 0x06 0x010 4 0 --- Number of ASR Events 0x06 0x018 4 3 --- Number of Interface CRC Errors 0x07 ===== = = === == Solid State Device Statistics (rev 1) == 0x07 0x008 1 4 N-- Percentage Used Endurance Indicator |||_ C monitored condition met ||__ D supports DSN |___ N normalized value

SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x0001 4 0 Command failed due to ICRC error 0x000a 4 0 Device-to-host register FISes sent due to a COMRESET

Crucial's own SMART report:

enter image description here enter image description here

I'm not too sure how to interpret the gsmartctl output, but I'm not convinced that the SMART PASSED result is correct. Time to bin and replace this drive?

Ian
  • 239
  • 5
  • 11

3 Answers3

12

Forget looking at traffic light type self-tests. You have been given a wealth of information (SMART figures) that you just need to evaluate. Manufacturers don't have an interest to show a negative check result anyway. Maybe one should replace all Airbus instruments in the cockpit by two giant lights, coloured red and green, representing "passed" and "failed"? :)

As opposed to what others say, ignore the normalized value because there is no defined norm for the normalization process. Therefore the same RAW input gives different output in normalized terms between one manufacturer and another.

On a HDD anything above 0 as a raw value for the reallocated sector count is a failure indicator - some other user wrote this here on Superuser referring to either Google or Backblaze. As opposed to a HDD, sector or rather flash block reallocation is part of the use process of an SSD. In your case 36864 is a huge number in the HDD world, for a SSD it might not matter. I would rather look at the wear indicator ID 202. Do not expect linear growth as this indicator might take into account the number of spare flash blocks and another calculus.

Destructive partitioning of your SSD

Due to the lack of my experience with SMART figures from SSDs (as opposed to HDDs) I can't answer your question but in case you want to keep this SSD, please have a look at the ID 181. This SMART argument suggests that you have partitioned your SSD the wrong way causing write amplification, adding wear to your SSD.

https://en.wikipedia.org/wiki/Write_amplification

Most likely you used a legacy operating system like for instance Windows XP in 32 bit when you partitioned the SSD. XP 32 bit tries to place partition starts on cylinder boundaries instead of multiples of a Megabyte (2^20). That way of operation is conflicting with the need to place partition starts on non-fractional multiples of the internal physical sector size. In your case the starting LBA number of your partition(s) should be dividable by 4096. (Necessary condition: LBA number MOD 4096=0). That is not the case now.

Copy the content of your SSD to a safe location, delete your partition table and repartition the SSD with a modern operating system. That modern OS will most likely put partition starts on multiples of 1 Megabyte which will comply with the above condition as well. Copy your content then back to your SSD. By doing so you are relieving future wear from your SSD as you reduce the write amplification.

You can use Testdisk to write a log file containing your current partition scheme. I guess there are suitable parameters for lsblkor fdisk as well.

Toby Speight
  • 5,213
r2d3
  • 4,050
6

I'm not sure that gsmartcontrol really understands and reports correctly all the SMART attributes, or that the disk firmware is correctly reporting its SMART attributes.

The SMART attributes show some errors and a weak disk, but not a catastrophic state, and yet the self-test fails and you report some file corruption.

Most puzzling is the "Reallocated Sector Count" whose raw count is 36864, which is catastrophic, but its normalized value is quite good at 99, which is only slightly below the best value of 100.

Unless you like living dangerously, I would in your place replace this disk.


I see you have added Crucial's own SMART report, which is much clearer than that of gsmartcontrol.

These are the danger signs :

364544 Retired NAND Blocks
20     NAND Page Program Failures
64     NAND Block Erase Failures
582    ECC Correction Failures
353    Corrected ECC

The worst data here is the number of Retired NAND Blocks, as defined by Cruclial's SSDs and SMART Data:

Attribute 5: Retired NAND Blocks

The number of blocks retired through this process of continually evaluating the quality of NAND blocks is tracked in SMART Attribute 5. SSD firmware will retire NAND blocks for several reasons in addition to the wear and data retention issue described above. One reason for retirement is a failure to erase a block while deleting data or moving data during garbage collection. This type of failure causes a low risk to user data since the data in question is being deleted or has already been copied successfully to a new location on the SSD.

This means that 364544 blocks on the disk have been become unusable with old age! This is enormous.

Final prognosis : The disk is failing and approaching its end of life. You should replace it as soon as possible.

harrymc
  • 498,455
4

In the SMART output, "pass" or "fail" is simply a one-line summary of the "SMART attributes" table. If none of the numbers in the "Value" or "Worst" columns is below the corresponding value in the "Threshold" column, it reports a "pass".

If a self-test is failing for any reason other than the test being canceled, the drive should be replaced, even if the SMART summary still says things are fine. With hard drives, SMART could only predict about half of all failures (and the SMART summary predicted even fewer). I don't think anybody's done a large-scale study of SMART on SSDs.

Mark
  • 1,608