Unable to identify SMART errors/issues of my NVMe disk

Question

I'm getting regular emails from the smart daemon about my NVMe disk.

SMART error (ErrorCount) detected on host: desk

This message was generated by the smartd daemon running on:
host name:  [redacted]
   DNS domain: [redacted]
The following warning/error was logged by the smartd daemon:
Device: /dev/nvme0, number of Error Log entries increased from 2519 to 2521
Device info:
KBG30ZMV256G TOSHIBA, S/N:X8OPD1PGP12P, FW:ADHA0101
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Sat Oct  7 23:38:04 2023 EDT
Another message will be sent in 24 hours if the problem persists.

I've been trying to figure this out for months but I've not had any luck. Here are the various commands I have tried and their output.

`smartctl -a /dev/nvme0`

$ sudo smartctl -a /dev/nvme0
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-13-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number:                       KBG30ZMV256G TOSHIBA
Serial Number:                      X8OPD1PGP12P
Firmware Version:                   ADHA0101
PCI Vendor/Subsystem ID:            0x1179
IEEE OUI Identifier:                0x00080d
Controller ID:                      0
NVMe Version:                       1.2.1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            00080d 04004ad9aa
Local Time is:                      Sun Oct 15 17:53:35 2023 EDT
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0017):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.30W       -        -    0  0  0  0        0       0
 1 +     2.70W       -        -    1  1  1  1        0       0
 2 +     2.30W       -        -    2  2  2  2        0       0
 3 -   0.0500W       -        -    4  4  4  4     8000   32000
 4 -   0.0050W       -        -    4  4  4  4     8000   40000
Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -    4096       0         0
 1 +     512       0         3
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        31 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    33%
Data Units Read:                    35,454,740 [18.1 TB]
Data Units Written:                 70,575,255 [36.1 TB]
Host Read Commands:                 306,457,518
Host Write Commands:                881,616,851
Controller Busy Time:               12,766
Power Cycles:                       342
Power On Hours:                     21,991
Unsafe Shutdowns:                   617
Media and Data Integrity Errors:    0
Error Information Log Entries:      2,528
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               31 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0       2528     0  0x301c  0xc002  0x000            -     4     -
  1       2527     0  0x201d  0xc004  0x028            -     1     -
  2       2526     0  0x101d  0xc004  0x028            -     1     -
  3       2525     0  0x6005  0xc002  0x000            -     4     -
  4       2524     0  0x6004  0xc004  0x028            -     1     -
  5       2523     0  0x5006  0xc004  0x028            -     1     -
  6       2522     0  0x1006  0xc005  0x028            -     1     -
  7       2521     0  0x4013  0xc005  0x028            -     0     -

`nvme error-log /dev/nvme0`

nvme.log

`nvme list`

$ sudo ./nvme-cli-latest-x86_64.AppImage list
Node                  Generic               SN                   Model                                    Namespace  Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- --------
/dev/nvme0n1          /dev/ng0n1            X8OPD1PGP12P         KBG30ZMV256G TOSHIBA                     0x1        256.06  GB / 256.06  GB    512   B +  0 B   ADHA0101

score 7 · Accepted Answer · 2023-10-26T14:31:20.050

Based on answer from Birkelund, here.

If you are asking how that error code is encoded in 0xC502, then its 0xC502 >> 1 to get rid of the Phase Tag. That leave us with 0x6281. Then apply a mask of 0x7ff to extract the lower 11 bytes (3 for the Status Code Type and 8 for the Status Code), ending up with 0x281. 0x2xx are “Media and Data Integrity Errors” and the 0x81 status code is “Unrecovered Read Error”.

We can apply same logic to your errors.

Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0       2528     0  0x301c  0xc002  0x000            -     4     -

Status 0xc002

get rid of phase tag (same as divide by 2): 0x6001.
Apply mask 0x7ff (same as taking the three right side nibbles) gives 0x001.
0x0xx gives us NVME_STATUS_TYPE_GENERIC_COMMAND
0x01 gives us NVME_STATUS_INVALID_COMMAND_OPCODE

Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  1       2527     0  0x201d  0xc004  0x028            -     1     -

Status 0xc004

get rid of phase tag: 0x6002.
Apply mask 0x7ff gives 0x002.
0x0xx gives us NVME_STATUS_TYPE_GENERIC_COMMAND
0x02 gives us NVME_STATUS_INVALID_FIELD_IN_COMMAND

etc..

In general this type of errors is caused by something sending invalid or unsupported commands to the NVMe SSD and as such is not something to worry about.

Lookup codes:

NVME_STATUS_TYPE_GENERIC_COMMAND = 0,
NVME_STATUS_TYPE_COMMAND_SPECIFIC = 1,
NVME_STATUS_TYPE_MEDIA_ERROR = 2,
NVME_STATUS_TYPE_VENDOR_SPECIFIC = 7,
// Status Code (SC) of NVME_STATUS_TYPE_GENERIC_COMMAND
NVME_STATUS_SUCCESS_COMPLETION = 0x00,
NVME_STATUS_INVALID_COMMAND_OPCODE = 0x01,
NVME_STATUS_INVALID_FIELD_IN_COMMAND = 0x02,
NVME_STATUS_COMMAND_ID_CONFLICT = 0x03,
NVME_STATUS_DATA_TRANSFER_ERROR = 0x04,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_POWER_LOSS_NOTIFICATION = 0x05,
NVME_STATUS_INTERNAL_DEVICE_ERROR = 0x06,
NVME_STATUS_COMMAND_ABORT_REQUESTED = 0x07,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_SQ_DELETION = 0x08,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_FAILED_FUSED_COMMAND = 0x09,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_FAILED_MISSING_COMMAND = 0x0A,
NVME_STATUS_INVALID_NAMESPACE_OR_FORMAT = 0x0B,
NVME_STATUS_COMMAND_SEQUENCE_ERROR = 0x0C,
NVME_STATUS_INVALID_SGL_LAST_SEGMENT_DESCR = 0x0D,
NVME_STATUS_INVALID_NUMBER_OF_SGL_DESCR = 0x0E,
NVME_STATUS_DATA_SGL_LENGTH_INVALID = 0x0F,
NVME_STATUS_METADATA_SGL_LENGTH_INVALID = 0x10,
NVME_STATUS_SGL_DESCR_TYPE_INVALID = 0x11,
NVME_STATUS_INVALID_USE_OF_CONTROLLER_MEMORY_BUFFER = 0x12,
NVME_STATUS_PRP_OFFSET_INVALID = 0x13,
NVME_STATUS_ATOMIC_WRITE_UNIT_EXCEEDED = 0x14,
NVME_STATUS_OPERATION_DENIED = 0x15,
NVME_STATUS_SGL_OFFSET_INVALID = 0x16,
NVME_STATUS_RESERVED = 0x17,
NVME_STATUS_HOST_IDENTIFIER_INCONSISTENT_FORMAT = 0x18,
NVME_STATUS_KEEP_ALIVE_TIMEOUT_EXPIRED = 0x19,
NVME_STATUS_KEEP_ALIVE_TIMEOUT_INVALID = 0x1A,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_PREEMPT_ABORT = 0x1B,
NVME_STATUS_SANITIZE_FAILED = 0x1C,
NVME_STATUS_SANITIZE_IN_PROGRESS = 0x1D,
NVME_STATUS_SGL_DATA_BLOCK_GRANULARITY_INVALID = 0x1E,
NVME_STATUS_DIRECTIVE_TYPE_INVALID = 0x70,
NVME_STATUS_DIRECTIVE_ID_INVALID = 0x71,
NVME_STATUS_NVM_LBA_OUT_OF_RANGE = 0x80,
NVME_STATUS_NVM_CAPACITY_EXCEEDED = 0x81,
NVME_STATUS_NVM_NAMESPACE_NOT_READY = 0x82,
NVME_STATUS_NVM_RESERVATION_CONFLICT = 0x83,
NVME_STATUS_FORMAT_IN_PROGRESS = 0x84,
// Status Code (SC) of NVME_STATUS_TYPE_COMMAND_SPECIFIC
NVME_STATUS_COMPLETION_QUEUE_INVALID = 0x00,
NVME_STATUS_INVALID_QUEUE_IDENTIFIER = 0x01,
NVME_STATUS_MAX_QUEUE_SIZE_EXCEEDED = 0x02,
NVME_STATUS_ABORT_COMMAND_LIMIT_EXCEEDED = 0x03,
NVME_STATUS_ASYNC_EVENT_REQUEST_LIMIT_EXCEEDED = 0x05,
NVME_STATUS_INVALID_FIRMWARE_SLOT = 0x06,
NVME_STATUS_INVALID_FIRMWARE_IMAGE = 0x07,
NVME_STATUS_INVALID_INTERRUPT_VECTOR = 0x08,
NVME_STATUS_INVALID_LOG_PAGE = 0x09,
NVME_STATUS_INVALID_FORMAT = 0x0A,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_CONVENTIONAL_RESET = 0x0B,
NVME_STATUS_INVALID_QUEUE_DELETION = 0x0C,
NVME_STATUS_FEATURE_ID_NOT_SAVEABLE = 0x0D,
NVME_STATUS_FEATURE_NOT_CHANGEABLE = 0x0E,
NVME_STATUS_FEATURE_NOT_NAMESPACE_SPECIFIC = 0x0F,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_NVM_SUBSYSTEM_RESET = 0x10,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_RESET = 0x11,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_MAX_TIME_VIOLATION = 0x12,
NVME_STATUS_FIRMWARE_ACTIVATION_PROHIBITED = 0x13,
NVME_STATUS_OVERLAPPING_RANGE = 0x14,
NVME_STATUS_NAMESPACE_INSUFFICIENT_CAPACITY = 0x15,
NVME_STATUS_NAMESPACE_IDENTIFIER_UNAVAILABLE = 0x16,
NVME_STATUS_NAMESPACE_ALREADY_ATTACHED = 0x18,
NVME_STATUS_NAMESPACE_IS_PRIVATE = 0x19,
NVME_STATUS_NAMESPACE_NOT_ATTACHED = 0x1A,
NVME_STATUS_NAMESPACE_THIN_PROVISIONING_NOT_SUPPORTED = 0x1B,
NVME_STATUS_CONTROLLER_LIST_INVALID = 0x1C,
NVME_STATUS_DEVICE_SELF_TEST_IN_PROGRESS = 0x1D,
NVME_STATUS_BOOT_PARTITION_WRITE_PROHIBITED = 0x1E,
NVME_STATUS_INVALID_CONTROLLER_IDENTIFIER = 0x1F,
NVME_STATUS_INVALID_SECONDARY_CONTROLLER_STATE = 0x20,
NVME_STATUS_INVALID_NUMBER_OF_CONTROLLER_RESOURCES = 0x21,
NVME_STATUS_INVALID_RESOURCE_IDENTIFIER = 0x22,
NVME_STATUS_STREAM_RESOURCE_ALLOCATION_FAILED = 0x7F,
NVME_STATUS_NVM_CONFLICTING_ATTRIBUTES = 0x80,
NVME_STATUS_NVM_INVALID_PROTECTION_INFORMATION = 0x81,
NVME_STATUS_NVM_ATTEMPTED_WRITE_TO_READ_ONLY_RANGE = 0x82,
// Status Code (SC) of NVME_STATUS_TYPE_MEDIA_ERROR
NVME_STATUS_NVM_WRITE_FAULT = 0x80,
NVME_STATUS_NVM_UNRECOVERED_READ_ERROR = 0x81,
NVME_STATUS_NVM_END_TO_END_GUARD_CHECK_ERROR = 0x82,
NVME_STATUS_NVM_END_TO_END_APPLICATION_TAG_CHECK_ERROR = 0x83,
NVME_STATUS_NVM_END_TO_END_REFERENCE_TAG_CHECK_ERROR = 0x84,
NVME_STATUS_NVM_COMPARE_FAILURE = 0x85,
NVME_STATUS_NVM_ACCESS_DENIED = 0x86,
NVME_STATUS_NVM_DEALLOCATED_OR_UNWRITTEN_LOGICAL_BLOCK = 0x87,

score 0 · Answer 2 · answered Dec 17 '23 at 23:36

In addition to @joep-van-steen's answer, if you can install the nvme-cli package, available by default in all major distros, the command will decode that for you:

Show the entire error log (with decoded descriptions) -- must run with sudo of course:

# nvme error-log /dev/nvme0

Or, retrieve only the lines with "errors":

# nvme error-log /dev/nvme0 | egrep -i 'status_field\s+\:\s+0[^\(]'

The output from the last command would be:

status_field    : 0x4002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)

Unable to identify SMART errors/issues of my NVMe disk

`smartctl -a /dev/nvme0`

`nvme error-log /dev/nvme0`

`nvme list`

2 Answers2

Linked