BuildZoid has been doing some further analysis of what’s going on inside Raptor Lake CPUs in regards to spikes in voltage. I thought I’d post his observations for anyone who’s interested, and I’ve added some crude background info of my own. If you already know this stuff then please be kind; there may be others who do not.
There appear to be two types of voltage spikes happening with Raptor Lake CPUs. The first is the normal transient spike that happens when you interrupt a heavy workload by manually stopping a stressful application (e.g. Prime95, Cinebench R23, etc.) or, more importantly, when moving the mouse. In the latter case, Windows detects the mouse movement and immediately interrupts what the CPU is doing in anticipation of user input. If Windows didn’t do this then we users would get locked out of our own unresponsive PCs…and that wouldn't be nice. However, once Windows detects that the user either just had a spasm, got bored, or is fooling with it, it returns the priority back to the currently running application(s). This type of transient voltage spike is normal and is basically unavoidable. Incidentally, AMD CPUs are actually worse than Intel CPUs because they tend to run much flatter voltage curves (i.e. LLC). More on that later.
A little background: The CPU voltage request system that talks to the motherboard VRM via VID requests, and the VRM power delivery system itself, is amazingly fast, but it’s not instantaneous from an electrical point of view. This means that there is a few clock cycles where the CPU is receiving far more voltage than it needs for the workload it's processing. That additional voltage “bleeds into the silicon” until it dissipates. A CPU has billions of transistors that are like tiny gates that open and close. The more gates that are open, the more current (and voltage) that can flow through the CPU, i.e. in one side and out the other. A multi-core process like Prime95 or Cinebench R23 will require huge numbers of these gates to be open to allow the digital data to flow through. When this type of process is momentarily interrupted, the gates get slammed shut and the incoming voltage crashes into the CPU silicon, a bit like a speeding car slamming into the back of a stopped car. Parts of both cars end up flying off in multiple directions until the energy of the collision is dissipated. Well, the same thing happens to the excess voltage entering the CPU. And, over time, these constant transient spikes
will degrade (i.e. damage) the processor. But if the CPU and underlying silicon were designed properly, the CPU should still function for years, sometimes decades. Now, it’s worth reiterating that in most cases these transient spikes are too quick for software like HWInfo to report. Keep that in mind. Much of the CPU activity will unfortunately remain a “black box” forever.
However, there appears to be a second source of transient spikes that is lasting much longer than the regular transient spikes. These spikes appear to be happening when an application is opened, etc., and it’s these secondary voltage spikes that appear to be what Intel is now concerned about in regards to the reports of prematurely degrading Raptor Lake CPUs. A little background: Intel’s microcode has algorithms that try to predict what the near future workload is going to be in order to send voltage requests to the motherboard VRM in time to process the request and deliver the voltage back to the CPU. It’s a little bit like the speculative branch executions that go on inside the CPU. Sometimes the CPU guesses wrong, but most of the time it guesses right, the processor IPC then benefits. Well, it appears that Raptor Lake microcode is overly aggressive (or buggy) when requesting higher voltages to prepare for incoming spikes in workload, such as guessing user input when you move the mouse or opening a new application or clicking on a function within an application. The August microcode patch will probably be aimed at smoothing out these speculative VID requests - possibly among other objectives of the new code.
In regards to managing transient spikes (at least the normal kind) you can significantly reduce the potential damage of these events by #1: reducing the overall starting voltage that the CPU runs at via the various undervolting techniques described by CiTay and others, and #2. reducing the LLC aggressiveness and thereby allowing more Vdroop under full-load scenarios. For example, MSI’s LLC=6 will have more Vdroop than LLC=3. More Vdroop is good right up to the point it causes instability, crashes and WHEA errors. That’s how you know you’ve gone too far. Keep in mind that as you lower the “starting voltage” via CPU Lite Load, or negative voltage offsets (e.g. the Adaptive Offset mode), you may need to reduce Vdroop by increasing the aggressiveness of the Load Line Calibration. If you are attempting to adjust both ends of the voltage curve (i.e. light loads and full loads) via these two mechanisms then it becomes a balancing act.
So, to recap. Regular transient spikes are normal (and therefore Intel should have considered these when designing the silicon) but buggy transients due to speculative requests is not. For many of us, the new microcode can’t come soon enough.
Side note: For those of you who have implement some kind of undervolt, perhaps following CiTay’s very impressive undervolting guide, you might just keep in the back of your mind that you might have to up that voltage a little to compensate for Intel lowering it’s speculative requests. If you get any kind of instability after the new microcode is installed, try that first before you freak out too much.
Boy, I think I must be trying to compete with CiTay for the longest post award! Sorry about that….appreciate your patience.
EDIT: Oh...I forgot to mention that if you
really want to stop transient spikes, just don't ever touch that mouse again!
