Comment by Numerlor
6 hours ago
Undervolting would definitely help, and is the actual fix. The current Intel fixes were mostly just for the symptoms, as the main issue is high voltage+power when pushing high clocks, but they can't actually fix that as it'd downgrade the advertised clocks the cpus were sold with
Sorry, but that understanding is dangerously incomplete. You're describing the first set of issues they uncovered, but there's also:
"Microcode and BIOS code requesting elevated core voltages which can cause Vmin shift especially during periods of idle and/or light activity" (emphasis mine)
https://community.intel.com/t5/Blogs/Tech-Innovation/Client/...
Recall also that "Vmin shift" means "the minimum voltage the processor needs to run correctly goes up" so if the issue isn't addressed, that level of undervolt may stop working
Not sure what's supposed to be wrong with that? The clock tree degrades at high voltage. Some theories I've seen were on the CPU requesting significantly higher voltages during alternating clocks when there's a short lull in load from e.g. a pipeline stall. Then there doesn't seem to be a good enough of a sensor net in the correct places for the CPU to react to this, so it just "burns" itself down gradually. Assuming these are true, actual fixes from intel would be relaxing boost clocks to ones that are universally safe and open themselves to a lawsuit from everyone that bought the high end SKUs, or do a new stepping which is extremely expensive for a done design.
When you degrade the CPU naturally needs higher voltages to be stable, until the point where it just breaks completely and no amount of voltage it help it. But if your CPU doesn't degrade because it hasn't been overdoing it on voltages then there'll be no issues for Vmin to shift.
As an anecdotal experience from someone I know that runs these in prod for game servers, limiting the CPU to 80°C and 1.4V-1.45V, 400A has been keeping them alive for years doing 24/7 loads. Maybe a bit lower on the voltage if one wants to be sure longer term, as they are fine with just mass RMAing these. There's also large amount of differences in the silicon quality between samples that can make one run cool and completely fine even at the old stock settings, and an another sample that'll have to pull say 1.5x the power for the same load and clocks having it degrade.