NVMe faulty drive? SMART Error Information Log Entries fastly increasing

Question

The Error Information Log Entries value showed by smartctl -a /dev/nvme0n1 in my NVMe is growing fast, by 1 per second. Is it indicative of a faulty driver?

At the same time, Media and Data Integrity Errors is currently showing a value of 0.

=== START OF INFORMATION SECTION ===
Model Number:                       KINGSTON SKC3000D4096G
Serial Number:                      xxxxx
Firmware Version:                   EIFK31.6
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Total NVM Capacity:                 4,096,805,658,624 [4.09 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,096,805,658,624 [4.09 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 282b2ba6c5
Local Time is:                      Fri Mar 24 01:33:14 2023 CET
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005d):     Comp DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x08):         Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     89 Celsius
Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.80W       -        -    0  0  0  0        0       0
 1 +     7.10W       -        -    1  1  1  1        0       0
 2 +     5.20W       -        -    2  2  2  2        0       0
 3 -   0.0620W       -        -    3  3  3  3     2500    7500
 4 -   0.0620W       -        -    4  4  4  4     2500    7500
Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        55 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    8%
Data Units Read:                    213,006,510 [109 TB]
Data Units Written:                 549,370,112 [281 TB]
Host Read Commands:                 11,210,192,197
Host Write Commands:                20,687,602,229
Controller Busy Time:               14,055
Power Cycles:                       39
Power On Hours:                     4,204
Unsafe Shutdowns:                   9
Media and Data Integrity Errors:    0
Error Information Log Entries:      1,479,242
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 2:               75 Celsius
Thermal Temp. 1 Total Time:         58745
Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0    1479242     0  0x2015  0x4004 0x102c            0     0     -
  1    1479241     0  0x2014  0x4004 0x102c            0     0     -
  2    1479240     0  0xd010  0x4004 0x102c            0     0     -
  3    1479239     0  0xc013  0x4004 0x102c            0     0     -
  4    1479238     0  0xb011  0x4004 0x102c            0     0     -
  5    1479237     0  0x8009  0x4004 0x102c            0     0     -
  6    1479236     0  0x0015  0x4004 0x102c            0     0     -
  7    1479235     0  0x0014  0x4004 0x102c            0     0     -
  8    1479234     0  0xa011  0x4004 0x102c            0     0     -
  9    1479233     0  0xa010  0x4004 0x102c            0     0     -
 10    1479232     0  0x9012  0x4004 0x102c            0     0     -
 11    1479231     0  0x9011  0x4004 0x102c            0     0     -
 12    1479230     0  0x6000  0x4004 0x102c            0     0     -
 13    1479229     0  0x5003  0x4004 0x102c            0     0     -
 14    1479228     0  0x4001  0x4004 0x102c            0     0     -
 15    1479227     0  0x4000  0x4004 0x102c            0     0     -
... (47 entries not read)

I uploaded the output of nvme error-log /dev/nvme0n1 too: https://pastebin.com/SQJM7KhV

Gotenks · Accepted Answer · 2023-03-26T13:53:26.387

4

In my case, it was caused by Node Exporter (Prometheus).

After stopping the process the Error Information Log Entries stopped increasing. Probably it's making queries which are not supported by the NVMe driver (will have to dig deeper).

UPDATE: I edited the hwmon collector code to exclude the faulty sensor: https://github.com/prometheus/node_exporter/issues/2643

edited Mar 26 '23 at 13:53

answered Mar 26 '23 at 00:40

Gotenks

211

NVMe faulty drive? SMART Error Information Log Entries fastly increasing

1 Answers1