3

The Error Information Log Entries value showed by smartctl -a /dev/nvme0n1 in my NVMe is growing fast, by 1 per second. Is it indicative of a faulty driver?

At the same time, Media and Data Integrity Errors is currently showing a value of 0.

=== START OF INFORMATION SECTION ===
Model Number:                       KINGSTON SKC3000D4096G
Serial Number:                      xxxxx
Firmware Version:                   EIFK31.6
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Total NVM Capacity:                 4,096,805,658,624 [4.09 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,096,805,658,624 [4.09 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 282b2ba6c5
Local Time is:                      Fri Mar 24 01:33:14 2023 CET
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005d):     Comp DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x08):         Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     89 Celsius

Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 8.80W - - 0 0 0 0 0 0 1 + 7.10W - - 1 1 1 1 0 0 2 + 5.20W - - 2 2 2 2 0 0 3 - 0.0620W - - 3 3 3 3 2500 7500 4 - 0.0620W - - 4 4 4 4 2500 7500

Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 2 1 - 4096 0 1

=== START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 55 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 8% Data Units Read: 213,006,510 [109 TB] Data Units Written: 549,370,112 [281 TB] Host Read Commands: 11,210,192,197 Host Write Commands: 20,687,602,229 Controller Busy Time: 14,055 Power Cycles: 39 Power On Hours: 4,204 Unsafe Shutdowns: 9 Media and Data Integrity Errors: 0 Error Information Log Entries: 1,479,242 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 2: 75 Celsius Thermal Temp. 1 Total Time: 58745

Error Information (NVMe Log 0x01, 16 of 63 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 1479242 0 0x2015 0x4004 0x102c 0 0 - 1 1479241 0 0x2014 0x4004 0x102c 0 0 - 2 1479240 0 0xd010 0x4004 0x102c 0 0 - 3 1479239 0 0xc013 0x4004 0x102c 0 0 - 4 1479238 0 0xb011 0x4004 0x102c 0 0 - 5 1479237 0 0x8009 0x4004 0x102c 0 0 - 6 1479236 0 0x0015 0x4004 0x102c 0 0 - 7 1479235 0 0x0014 0x4004 0x102c 0 0 - 8 1479234 0 0xa011 0x4004 0x102c 0 0 - 9 1479233 0 0xa010 0x4004 0x102c 0 0 - 10 1479232 0 0x9012 0x4004 0x102c 0 0 - 11 1479231 0 0x9011 0x4004 0x102c 0 0 - 12 1479230 0 0x6000 0x4004 0x102c 0 0 - 13 1479229 0 0x5003 0x4004 0x102c 0 0 - 14 1479228 0 0x4001 0x4004 0x102c 0 0 - 15 1479227 0 0x4000 0x4004 0x102c 0 0 - ... (47 entries not read)

I uploaded the output of nvme error-log /dev/nvme0n1 too: https://pastebin.com/SQJM7KhV

Gotenks
  • 211

1 Answers1

4

In my case, it was caused by Node Exporter (Prometheus).

After stopping the process the Error Information Log Entries stopped increasing. Probably it's making queries which are not supported by the NVMe driver (will have to dig deeper).

UPDATE: I edited the hwmon collector code to exclude the faulty sensor: https://github.com/prometheus/node_exporter/issues/2643

Gotenks
  • 211