9

I'm experiencing a puzzling discrepancy in ECC error reporting between Linux and Windows on my setup. Hoping the community can help me understand why.

Hardware

  • CPU: AMD Ryzen 7 7700
  • Motherboard: ASUS TUF Gaming B650 Plus
  • RAM: 4x32GB Samsung M324R4GA3BB0-CQK ECC UDIMMs (JEDEC 4800MT/s native speed)

Configuration

  • BIOS: Version 3222 (AGESA ComboAM5PI 1.2.0.3a Patch A), latest available.
  • Memory Settings:
    • Speed: Manually set to 4800MT/s (no timings/voltage adjustments).
    • ECC: Enabled in BIOS.

Observations

  • Linux (SystemRescue, kernel 6.12.19):
    • Tools: mprime (Prime95) and memtester.
    • Results: several Corrected ECC errors every hour (logged via dmesg).
    • System remains stable (no crashes/uncorrected errors).
  • Windows 11 (fully updated):
    • Tools: Prime95 and TM5 (TestMem5).
    • Results: Very rare WHEA-Logger entries, EventID 47 (filtered via Event Viewer). Often none for hours. Component: Memory Error Source: Corrected Machine Check
    • System also stable (no crashes/BSODs).

The Mystery
Why does Linux log frequent corrected ECC errors under load, while Windows 11 shows almost none?
Does Windows 11 genuinely experience fewer errors, or is it simply not logging them (e.g., WHEA filters "corrected" errors)?
How can I test whether the errors are occurring silently in Windows?

Linux error example

[  313.863632] mce: [Hardware Error]: Machine check events logged
[  313.863644] [Hardware Error]: Corrected error, no action required.
[  313.865010] [Hardware Error]: CPU:0 (19:61:2) MC21_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000400011b
[  313.865970] [Hardware Error]: Error Addr: 0x0000000e52d461c0
[  313.866806] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x00e880000a800301
[  313.868577] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
[  313.868848] EDAC MC0: 1 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x1d25a8c offset:0x2c0 grain:64 syndrome:0x8000)
[  313.870600] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

Windows error example

A corrected hardware error has occurred.

Component: Memory Error Source: Corrected Machine Check

- <EventData>
  <Data Name="ErrorSource">1</Data> 
  <Data Name="FRUId">{00000000-0000-0000-0000-000000000000}</Data> 
  <Data Name="FRUText" /> 
  <Data Name="ValidBits">0x2</Data> 
  <Data Name="ErrorStatus">0x0</Data> 
  <Data Name="PhysicalAddress">0xe8ec5ad40</Data> 
  <Data Name="PhysicalAddressMask">0x0</Data> 
  <Data Name="Node">0x0</Data> 
  <Data Name="Card">0x0</Data> 
  <Data Name="Module">0x0</Data> 
  <Data Name="Bank">0x0</Data> 
  <Data Name="Device">0x0</Data> 
  <Data Name="Row">0x0</Data> 
  <Data Name="Column">0x0</Data> 
  <Data Name="BitPosition">0x0</Data> 
  <Data Name="RequesterId">0x0</Data> 
  <Data Name="ResponderId">0x0</Data> 
  <Data Name="TargetId">0x0</Data> 
  <Data Name="ErrorType">0</Data> 
  <Data Name="Extended">0</Data> 
  <Data Name="RankNumber">0</Data> 
  <Data Name="CardHandle">0</Data> 
  <Data Name="ModuleHandle">0</Data> 
  <Data Name="Length">1919</Data> 
  <Data Name="RawData">435045521002FFFFFFFF040002000000020000007F07000014090400150419140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB57131B18BCE2DD7BD0E45B9AD9CF4EBD4F8905BFF240473B2DB01000000000000000000000000000000000000000000000000A00100005000000000030000010000001411BCA5646FDE4EB8633E83ED7C83B100000000000000000000000000000000020000000000000000000000000000000000000000000000F0010000C00000000003000000000000ADCC7698B447DB4BB65E16F193C4F3DB00000000000000000000000000000000020000000000000000000000000000000000000000000000B0020000A80400000003000000000000011D1E8AF94257459C33565E5CC3F7E80000000000000000000000000000000002000000000000000000000000000000000000000000000058070000270000000003000000000000A13248C3C302524CA9F19F1D5D7723FC000000000000000000000000000000000300000000000000000000000000000000000000000000000200000000000000000000000000000040ADC58E0E00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007F010000000000000002010000030000120FA6000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000004000000020000002AE2882873B2DB010000000000000000000000000000000000000000150000001B010004004020D440ADC58E0E00000000000000000000000A00000000000000000F0500960000000103800A0080E800FD010000070000000000000000000000000000000000000000000001000010D00000000000001000000000000000100000000000000010001B00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000FF00000000000000000000000000000000000000000000000000</Data> 
  </EventData>
  </Event>

I assume they are the same hardware events.

edo1
  • 191

1 Answers1

4

According to MemTest86 Technical Information, the x86 system architecture reports memory ECC errors through hardware registers that are to be polled.

Due to different memory controller architectures amongst different chipsets, there is no common ECC error framework; specific ECC polling code is required for each chipset. In particular, this would involve polling one or more of the following hardware registers:
...

Given the variety of how memory errors are detected and reported by the hardware, we can expect a similar variety in system software.

How do I know when ECC errors are detected?

The mechanism for how ECC errors are logged and reported to the end-user depends on the BIOS and operating system. In most cases, corrected ECC errors are written to system/event logs. Uncorrected ECC errors may result in kernel panic or blue screen.

How does MemTest86 report ECC errors?

MemTest86 directly polls ECC errors logged in the chipset/memory controller registers and displays it to the user on-screen.


Apparently there is no interrupt event caused by a corrected ECC error that the OS must handle. That avoids any potential system overload if a burst of memory errors were to occur.
Rather corrected ECC errors generate information that the OS can gather when it chooses.

So if Windows 11 seems to report (corrected) ECC errors less frequently than Linux. that could indicate it retrieves these hardware logs at a different rate or manner than Linux.
Whether corrected errors are inconsequential (and can be ignored), or a predictor of a possible HW failure (and salient) is probably debatable.

Your Linux distro apparently chooses to log all available information, but you also have some control of what messages are displayed on the system console (e.g. with the loglevel kernel parameter).
Perhaps Windows tries to be more user-friendly and less alarmist, and constrains its error reporting since the user really needs to take no (immediate) remedial action.


Something suggested by MemTest86 Technical Information:

If your system allows for it, try disabling Quick Boot in the BIOS, some error messages should disappear.

sawdust
  • 18,591