Note: this question is a duplicate of a prior question that has received a pretty detailed answer.
In the summer of 2023, I have published a long overdue summary of this topic at my employer's website.
=========
Dear fellow Superusers,
you may remember my inappropriately broad question a while ago... this is sequel. This time I have enough data to be more specific. Caught one culprit in the field and this time we were somewhat prepared. But, the first results are perplexing to me. It's not EIST, it's not PROCHOT, it's not CLOCKMOD. After today I am wondering: "what do Windows know that I do not" ? I must be missing some sweet secret - another MSR or clock steering mechanism...
The problem: on a couple dozen pieces of a PC running Windows Server 2012 R2, "every now and then" at random, some random machine starts to "ooze like molasses". The CPU becomes subjectively very slow. Windows Task Manager in 2012 reports the CPU clock firmly at "0.22 GHz", which is a weird value. The problem vanishes upon a power cycle.
The effective average MTTF so far has been about 20 years, i.e. impossible to reproduce in a lab. The machines run cool, far away from thermal throttling thresholds. Verified by generating 100% CPU load for a few days, while recording the coretemp sensor. In production, the machines actually run nearly idle for months - verified by some relevant messages in the Event Log, saying that the CPU has been at the lowest EIST clock for another day - this for 100+ consecutive days, until a power-cycle. There are no local environmental factors or other circumstances to correlate this behavior to.
The CPU is a Haswell Core i7 Core i7-4650U: dual-core with HT, 15W TDP, actually consuming maybe < 5W at idle. Nominally a "mobile" CPU - athough the machine is not a notebook, does not have a battery. Does have an embedded controller.
The CPU's nominal clock frequency (the one printed on the tin) is 1.7 GHz. EIST max is 2.3 GHz. Turbo can boost up to 3.3 GHz single core, 2.9 GHz across all cores. The reference clock appears to be 100 MHz from the onboard clock synth (actually more like 98 MHz, apparently). Under normal circumstances, in Windows or Linux at a typical workload = idle, the CPU frequency just stays put at 800 MHz (790 MHz reported) i.e. multiplier = 8.
Now for those 0.22 GHz. This is a single value, averaged over all four cores. Also, it is not a physical clock rate - rather, it is inferred by Windows from some nominal clock rate and some hardware "performance gages", probably MSR registers, where the value is supposedly a percentage of maximum performance.
Effective clock = Nominal max clock * "frequency percentage gage" * "throttling percentage gage"
or, in the parlance of Windows Performance Counters:
Effective clock = Nominal max clock * "% Processor Performance"
That averaged across all four cores for the Task Manager GUI. The Windows performance counters (a software-level API/UI of the OS) makes the counters available per core, aggregated per CPU package and total per system.
That formula has been working for me in the lab, i.e. on a healthy system. I haven't found a Windows performance counter holding the "EIST Max" = our factual "frame of reference" for the "% Processor Performance" but never mind... The closest I could get by fiddling with ClockMod on a healthy machine idling at 800 MHz was by throttling two cores to 4/16 and two to 5/16, via the IA32_CLOCK_MODULATION MSR. Done using the uclewebb's MSR Tool - the "effective CPU clock" reported by the Windows Task Manager kept flipping between 0.23 and 0.25 GHz.
Following several people's advice that this might have to do with the PROCHOT signal or on-demand throttling, have I cobbled together my own tools to access MSR_POWER_CTL, IA32_PACKAGE_THERM_STATUS, IA32_THERM_STATUS and IA32_CLOCK_MODULATION - to have something small, simple and to the point, when investigating the next culprit out in the wild.
And, that culprit has come today, and, ...I'm in a bit of a shock. No PROCHOT sources are active, apparently PROCHOT has never happend since the last power-up, and "on-demand throtling" (aka CLOCKMOD) is off as well.
...the same for CPU cores 1,2 and 3.
We've tried disabling BD-PROCHOT - the disabled flag sticks in the MSR, but the problem does not go away. Which is not surprising. We've tried fiddling with the CLOCKMOD bits - using another tool, that does read + write + readback. The tool works, but brings no improvement (again not surprisingly).
I have also taken a look at some Windows performance counters - using a command-line proggie called perf32 from the SnmpTools by Erwan L. Here is the output:
"Processor Information\% Processor Performance\_Total"
9.49953814259068
"Processor Information% Processor Performance\0,0"
9.99836227595223
"Processor Information% Processor Performance\0,1"
9.00490682886555
"Processor Information% Processor Performance\0,2"
9.95220023870093
"Processor Information% Processor Performance\0,3"
8.99923060510645
"Processor Information% Processor Utility_Total"
9.15043821293382
"Processor Information% of Maximum Frequency_Total"
73
"Processor Information\Processor Frequency_Total"
1700
"Processor Information\Processor Frequency\0,0"
1700
the same for CPU core 1,2 and 3
"Processor Information\Processor State Flags\0,0"
1
the same for CPU core 1,2 and 3
"Processor Information\Parking Status_Total"
0
"Processor\Interrupts/sec\0"
1979.5905212033
"Processor% Interrupt Time_Total"
0.777807192025936
So... the CPU cores are stuck at their nominal "on the tin" frequency of 1.7 GHz, which is 75% of 2.3 GHz (EIST max). On demand throttling is off. But, something makes Windows believe that the overall "Processor Performance percentage" is just 9 or 10 per cent. Two cores at 9%, two cores at 10%, resulting in 9.5% aggregate systemwide percentage. 0.095 * 2300 MHz = 218 MHz .
Notice those percentages, their granularity: 9 and 10 per cent, rather precisely.
How does that fit in with the integer EIST multiplier (which is apparently locked at 17) and the 1/16th CLOCKMOD duty cycle, which also appears to be off in the first place?
What other factor do Windows take into account, calculating that CPU performance percentage? Or is it just read verbatim from the hardware? What MSR's should I check out in the hardware, to verify/understand the numbers that Windows are reporting?
does perhaps Turbo (the successor to EIST) make the performance control more granular, "free from the EIST limits", and does it bring some relevant new MSR's along?
those percentages might actually be wrong... the culprit system in the faulty state may actually be even slower than what the percentages would suggest. Just by subjective comparison to my lab experiments with CLOCKMOD throttling, at a similar "effective clock rate" around 0.25 MHz.
Thanks for your time. Any ideas welcome.
EDIT: oh damn - yes there are a number of MSR's. I wish Turbostat was available for Windows.
EDIT: I've written a tool to perform a raw dump of a given range of MSR's, on screen and into a CSV file. Turns out that the MSR space is rather sparse... This tool will allow me to take a dump of some meaningful range on the culprit/patient and on a healthy box, for comparison. The comparison can be somewhat automated as well. I can then wade through the data with an Intel manual at hand, focus on differences etc.
I've also tried using the Windows build of AcpiCA's acpidump etc. Haven't tried it on the culprit machine yet, but on my old notebook (2015 model Lenovo with UEFI) it couldn't find the _PSS object... I may give it another try on the problematic box.
