4

Here's a part of my smartctl -H /dev/sda output:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   053   028   045    Old_age   Always   In_the_past 47

How do I interpret this? Specifically:

  • What do the flag bits mean?
  • What's the difference between the 'value' and the 'raw value'?
  • What units is each numeric column using? Is it Celsius? In that case, why is 28 centigrade the worst I've had, if now I have 53? Or 47?
  • Is the threshold the value over which the drive is considered to fail? The value over which the drive shuts itself down? Something else?
fixer1234
  • 28,064
einpoklum
  • 10,666

2 Answers2

2

First of all, the S.M.A.R.T. specification is pretty much non-existent:

"S.M.A.R.T." came to be understood (though without any formal specification) to refer to a variety of specific metrics and methods and to apply to protocols unrelated to ATA for communicating the same kinds of things.

It's a "standard" that is missing clear guidelines, but follows some common concepts. It was developed by different companies for different types of storage. The interpretation is vendor specific (and disk type specific). There are few common attributes that are usually the same for modern disks, e.g. Power_On_Hours (although there are some exceptions):

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  9 Power_On_Hours          -O--CK   046   046   000    -    48089

In order to understand smartctl flags, you'd have to convert the number to binary and apply bit mask. Fortunately there's an easier way, just use -x switch, smartctl -x /dev/sda which gives you quite verbose output:

...
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   100   100   000    -    0
  5 Reallocate_NAND_Blk_Cnt -O--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    49936
 12 Power_Cycle_Count       -O--CK   100   100   000    -    12
171 Program_Fail_Count      -O--CK   100   100   000    -    0
172 Erase_Fail_Count        -O--CK   100   100   000    -    0
173 Ave_Block-Erase_Count   -O--CK   092   092   000    -    127
174 Unexpect_Power_Loss_Ct  -O--CK   100   100   000    -    8
183 SATA_Interfac_Downshift -O--CK   100   100   000    -    0
184 Error_Correction_Count  -O--CK   100   100   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   055   048   000    -    45 (Min/Max 22/52)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_ECC_Cnt -O--CK   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   000    -    0
202 Percent_Lifetime_Remain ----CK   092   092   001    -    8
206 Write_Error_Rate        -OSR--   100   100   000    -    0
246 Total_LBAs_Written      -O--CK   100   100   000    -    29233249069
247 Host_Program_Page_Count -O--CK   100   100   000    -    918009728
248 FTL_Program_Page_Count  -O--CK   100   100   000    -    630581294
180 Unused_Reserve_NAND_Blk PO--CK   000   000   000    -    1234
210 Success_RAIN_Recov_Cnt  -O--CK   100   100   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

Each attribute might be assigned multiple (binary) flags. You might be interested in attributes with P flag - prefailure warnings. Or error counters (R flag) can be a sign of upcoming problems.

The Raw value is the most reliable one. There was a noble idea to normalize all attributes to common range 0 to 100 and if the output would be all 100 the disk is healthy. However different vendors try to encode various stuff into such field. E.g. for attribute Temperature_Celsius the raw value is:

45 (Min/Max 22/52)

when converted to value it gives you 55 (which is weird) nontheless smartctl would convert it to integer as 223339741229 (which is definitely out of Celsius scale).

Temperature is typically reported in degrees of Celsius, many vendors write the unit in the attribute name. According to Google's paper from 2007 (on rotational disks) the temperature isn't good indicator for disk failures.

Tombart
  • 1,805
  • 1
  • 16
  • 18
2

What's the difference between the 'value' and the 'raw value'?

For example Attribute 12 is "power cycle count": how many times has the disk been powered up.

Each Attribute has a "Raw" value, printed under the heading "RAW_VALUE", and a "Normalized" value printed under the heading "VALUE". [Note: smartctl prints these values in base-10.] In the example just given, the "Raw Value" for Attribute 12 would be the actual number of times that the disk has been power-cycled, for example 365 if the disk has been turned on once per day for exactly one year. Each vendor uses their own algorithm to convert this "Raw" value to a "Normalized" value in the range from 1 to 254. Please keep in mind that smartctl only reports the different Attribute types, values, and thresholds as read from the device. It does not carry out the conversion between "Raw" and "Normalized" values: this is done by the disk's firmware.

What units is each numeric column using?

The conversion from Raw value to a quantity with physical units is not specified by the SMART standard. In most cases, the values printed by smartctl are sensible. For example the temperature Attribute generally has its raw value equal to the temperature in Celsius. However in some cases vendors use unusual conventions. For example the Hitachi disk on my laptop reports its power-on hours in minutes, not hours. Some IBM disks track three temperatures rather than one, in their raw values. And so on.

Source

Jan
  • 1,960