3

I'm getting regular emails from the smart daemon about my NVMe disk.

SMART error (ErrorCount) detected on host: desk

This message was generated by the smartd daemon running on:

host name: [redacted] DNS domain: [redacted]

The following warning/error was logged by the smartd daemon:

Device: /dev/nvme0, number of Error Log entries increased from 2519 to 2521

Device info: KBG30ZMV256G TOSHIBA, S/N:X8OPD1PGP12P, FW:ADHA0101

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation. The original message about this issue was sent at Sat Oct 7 23:38:04 2023 EDT Another message will be sent in 24 hours if the problem persists.

I've been trying to figure this out for months but I've not had any luck. Here are the various commands I have tried and their output.

smartctl -a /dev/nvme0

$ sudo smartctl -a /dev/nvme0
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-13-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION === Model Number: KBG30ZMV256G TOSHIBA Serial Number: X8OPD1PGP12P Firmware Version: ADHA0101 PCI Vendor/Subsystem ID: 0x1179 IEEE OUI Identifier: 0x00080d Controller ID: 0 NVMe Version: 1.2.1 Number of Namespaces: 1 Namespace 1 Size/Capacity: 256,060,514,304 [256 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 00080d 04004ad9aa Local Time is: Sun Oct 15 17:53:35 2023 EDT Firmware Updates (0x12): 1 Slot, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test Optional NVM Commands (0x0017): Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Log Page Attributes (0x02): Cmd_Eff_Lg Maximum Data Transfer Size: 512 Pages Warning Comp. Temp. Threshold: 82 Celsius Critical Comp. Temp. Threshold: 85 Celsius

Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 3.30W - - 0 0 0 0 0 0 1 + 2.70W - - 1 1 1 1 0 0 2 + 2.30W - - 2 2 2 2 0 0 3 - 0.0500W - - 4 4 4 4 8000 32000 4 - 0.0050W - - 4 4 4 4 8000 40000

Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 - 4096 0 0 1 + 512 0 3

=== START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 31 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 33% Data Units Read: 35,454,740 [18.1 TB] Data Units Written: 70,575,255 [36.1 TB] Host Read Commands: 306,457,518 Host Write Commands: 881,616,851 Controller Busy Time: 12,766 Power Cycles: 342 Power On Hours: 21,991 Unsafe Shutdowns: 617 Media and Data Integrity Errors: 0 Error Information Log Entries: 2,528 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 31 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 2528 0 0x301c 0xc002 0x000 - 4 - 1 2527 0 0x201d 0xc004 0x028 - 1 - 2 2526 0 0x101d 0xc004 0x028 - 1 - 3 2525 0 0x6005 0xc002 0x000 - 4 - 4 2524 0 0x6004 0xc004 0x028 - 1 - 5 2523 0 0x5006 0xc004 0x028 - 1 - 6 2522 0 0x1006 0xc005 0x028 - 1 - 7 2521 0 0x4013 0xc005 0x028 - 0 -

nvme error-log /dev/nvme0

nvme.log

nvme list

$ sudo ./nvme-cli-latest-x86_64.AppImage list
Node                  Generic               SN                   Model                                    Namespace  Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- --------
/dev/nvme0n1          /dev/ng0n1            X8OPD1PGP12P         KBG30ZMV256G TOSHIBA                     0x1        256.06  GB / 256.06  GB    512   B +  0 B   ADHA0101

2 Answers2

7

Based on answer from Birkelund, here.

If you are asking how that error code is encoded in 0xC502, then its 0xC502 >> 1 to get rid of the Phase Tag. That leave us with 0x6281. Then apply a mask of 0x7ff to extract the lower 11 bytes (3 for the Status Code Type and 8 for the Status Code), ending up with 0x281. 0x2xx are “Media and Data Integrity Errors” and the 0x81 status code is “Unrecovered Read Error”.

We can apply same logic to your errors.

Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0       2528     0  0x301c  0xc002  0x000            -     4     -

Status 0xc002

  • get rid of phase tag (same as divide by 2): 0x6001.
  • Apply mask 0x7ff (same as taking the three right side nibbles) gives 0x001.
  • 0x0xx gives us NVME_STATUS_TYPE_GENERIC_COMMAND
  • 0x01 gives us NVME_STATUS_INVALID_COMMAND_OPCODE

Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  1       2527     0  0x201d  0xc004  0x028            -     1     -

Status 0xc004

  • get rid of phase tag: 0x6002.
  • Apply mask 0x7ff gives 0x002.
  • 0x0xx gives us NVME_STATUS_TYPE_GENERIC_COMMAND
  • 0x02 gives us NVME_STATUS_INVALID_FIELD_IN_COMMAND

etc..

In general this type of errors is caused by something sending invalid or unsupported commands to the NVMe SSD and as such is not something to worry about.

Lookup codes:

NVME_STATUS_TYPE_GENERIC_COMMAND = 0,
NVME_STATUS_TYPE_COMMAND_SPECIFIC = 1,
NVME_STATUS_TYPE_MEDIA_ERROR = 2,
NVME_STATUS_TYPE_VENDOR_SPECIFIC = 7,

// Status Code (SC) of NVME_STATUS_TYPE_GENERIC_COMMAND

NVME_STATUS_SUCCESS_COMPLETION = 0x00, NVME_STATUS_INVALID_COMMAND_OPCODE = 0x01, NVME_STATUS_INVALID_FIELD_IN_COMMAND = 0x02, NVME_STATUS_COMMAND_ID_CONFLICT = 0x03, NVME_STATUS_DATA_TRANSFER_ERROR = 0x04, NVME_STATUS_COMMAND_ABORTED_DUE_TO_POWER_LOSS_NOTIFICATION = 0x05, NVME_STATUS_INTERNAL_DEVICE_ERROR = 0x06, NVME_STATUS_COMMAND_ABORT_REQUESTED = 0x07, NVME_STATUS_COMMAND_ABORTED_DUE_TO_SQ_DELETION = 0x08, NVME_STATUS_COMMAND_ABORTED_DUE_TO_FAILED_FUSED_COMMAND = 0x09, NVME_STATUS_COMMAND_ABORTED_DUE_TO_FAILED_MISSING_COMMAND = 0x0A, NVME_STATUS_INVALID_NAMESPACE_OR_FORMAT = 0x0B, NVME_STATUS_COMMAND_SEQUENCE_ERROR = 0x0C, NVME_STATUS_INVALID_SGL_LAST_SEGMENT_DESCR = 0x0D, NVME_STATUS_INVALID_NUMBER_OF_SGL_DESCR = 0x0E, NVME_STATUS_DATA_SGL_LENGTH_INVALID = 0x0F, NVME_STATUS_METADATA_SGL_LENGTH_INVALID = 0x10, NVME_STATUS_SGL_DESCR_TYPE_INVALID = 0x11, NVME_STATUS_INVALID_USE_OF_CONTROLLER_MEMORY_BUFFER = 0x12, NVME_STATUS_PRP_OFFSET_INVALID = 0x13, NVME_STATUS_ATOMIC_WRITE_UNIT_EXCEEDED = 0x14, NVME_STATUS_OPERATION_DENIED = 0x15, NVME_STATUS_SGL_OFFSET_INVALID = 0x16, NVME_STATUS_RESERVED = 0x17, NVME_STATUS_HOST_IDENTIFIER_INCONSISTENT_FORMAT = 0x18, NVME_STATUS_KEEP_ALIVE_TIMEOUT_EXPIRED = 0x19, NVME_STATUS_KEEP_ALIVE_TIMEOUT_INVALID = 0x1A, NVME_STATUS_COMMAND_ABORTED_DUE_TO_PREEMPT_ABORT = 0x1B, NVME_STATUS_SANITIZE_FAILED = 0x1C, NVME_STATUS_SANITIZE_IN_PROGRESS = 0x1D, NVME_STATUS_SGL_DATA_BLOCK_GRANULARITY_INVALID = 0x1E, NVME_STATUS_DIRECTIVE_TYPE_INVALID = 0x70, NVME_STATUS_DIRECTIVE_ID_INVALID = 0x71, NVME_STATUS_NVM_LBA_OUT_OF_RANGE = 0x80, NVME_STATUS_NVM_CAPACITY_EXCEEDED = 0x81, NVME_STATUS_NVM_NAMESPACE_NOT_READY = 0x82, NVME_STATUS_NVM_RESERVATION_CONFLICT = 0x83, NVME_STATUS_FORMAT_IN_PROGRESS = 0x84,

// Status Code (SC) of NVME_STATUS_TYPE_COMMAND_SPECIFIC

NVME_STATUS_COMPLETION_QUEUE_INVALID = 0x00, NVME_STATUS_INVALID_QUEUE_IDENTIFIER = 0x01, NVME_STATUS_MAX_QUEUE_SIZE_EXCEEDED = 0x02, NVME_STATUS_ABORT_COMMAND_LIMIT_EXCEEDED = 0x03, NVME_STATUS_ASYNC_EVENT_REQUEST_LIMIT_EXCEEDED = 0x05, NVME_STATUS_INVALID_FIRMWARE_SLOT = 0x06, NVME_STATUS_INVALID_FIRMWARE_IMAGE = 0x07, NVME_STATUS_INVALID_INTERRUPT_VECTOR = 0x08, NVME_STATUS_INVALID_LOG_PAGE = 0x09, NVME_STATUS_INVALID_FORMAT = 0x0A, NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_CONVENTIONAL_RESET = 0x0B, NVME_STATUS_INVALID_QUEUE_DELETION = 0x0C, NVME_STATUS_FEATURE_ID_NOT_SAVEABLE = 0x0D, NVME_STATUS_FEATURE_NOT_CHANGEABLE = 0x0E, NVME_STATUS_FEATURE_NOT_NAMESPACE_SPECIFIC = 0x0F, NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_NVM_SUBSYSTEM_RESET = 0x10, NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_RESET = 0x11, NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_MAX_TIME_VIOLATION = 0x12, NVME_STATUS_FIRMWARE_ACTIVATION_PROHIBITED = 0x13, NVME_STATUS_OVERLAPPING_RANGE = 0x14, NVME_STATUS_NAMESPACE_INSUFFICIENT_CAPACITY = 0x15, NVME_STATUS_NAMESPACE_IDENTIFIER_UNAVAILABLE = 0x16, NVME_STATUS_NAMESPACE_ALREADY_ATTACHED = 0x18, NVME_STATUS_NAMESPACE_IS_PRIVATE = 0x19, NVME_STATUS_NAMESPACE_NOT_ATTACHED = 0x1A, NVME_STATUS_NAMESPACE_THIN_PROVISIONING_NOT_SUPPORTED = 0x1B, NVME_STATUS_CONTROLLER_LIST_INVALID = 0x1C, NVME_STATUS_DEVICE_SELF_TEST_IN_PROGRESS = 0x1D, NVME_STATUS_BOOT_PARTITION_WRITE_PROHIBITED = 0x1E, NVME_STATUS_INVALID_CONTROLLER_IDENTIFIER = 0x1F, NVME_STATUS_INVALID_SECONDARY_CONTROLLER_STATE = 0x20, NVME_STATUS_INVALID_NUMBER_OF_CONTROLLER_RESOURCES = 0x21, NVME_STATUS_INVALID_RESOURCE_IDENTIFIER = 0x22, NVME_STATUS_STREAM_RESOURCE_ALLOCATION_FAILED = 0x7F, NVME_STATUS_NVM_CONFLICTING_ATTRIBUTES = 0x80, NVME_STATUS_NVM_INVALID_PROTECTION_INFORMATION = 0x81, NVME_STATUS_NVM_ATTEMPTED_WRITE_TO_READ_ONLY_RANGE = 0x82,

// Status Code (SC) of NVME_STATUS_TYPE_MEDIA_ERROR

NVME_STATUS_NVM_WRITE_FAULT = 0x80, NVME_STATUS_NVM_UNRECOVERED_READ_ERROR = 0x81, NVME_STATUS_NVM_END_TO_END_GUARD_CHECK_ERROR = 0x82, NVME_STATUS_NVM_END_TO_END_APPLICATION_TAG_CHECK_ERROR = 0x83, NVME_STATUS_NVM_END_TO_END_REFERENCE_TAG_CHECK_ERROR = 0x84, NVME_STATUS_NVM_COMPARE_FAILURE = 0x85, NVME_STATUS_NVM_ACCESS_DENIED = 0x86, NVME_STATUS_NVM_DEALLOCATED_OR_UNWRITTEN_LOGICAL_BLOCK = 0x87,

0

In addition to @joep-van-steen's answer, if you can install the nvme-cli package, available by default in all major distros, the command will decode that for you:

Show the entire error log (with decoded descriptions) -- must run with sudo of course:

# nvme error-log /dev/nvme0

Or, retrieve only the lines with "errors":

# nvme error-log /dev/nvme0 | egrep -i 'status_field\s+\:\s+0[^\(]'

The output from the last command would be:

status_field    : 0x4002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)