Comment by Numerlor
1 hour ago
Not sure what's supposed to be wrong with that? The clock tree degrades at high voltage. Some theories I've seen were on the CPU requesting significantly higher voltages during alternating clocks when there's a short lull in load from e.g. a pipeline stall. Then there doesn't seem to be a good enough of a sensor net in the correct places for the CPU to react to this, so it just "burns" itself down gradually. Assuming these are true, actual fixes from intel would be relaxing boost clocks to ones that are universally safe and open themselves to a lawsuit from everyone that bought the high end SKUs, or do a new stepping which is extremely expensive for a done design.
When you degrade the CPU naturally needs higher voltages to be stable, until the point where it just breaks completely and no amount of voltage it help it. But if your CPU doesn't degrade because it hasn't been overdoing it on voltages then there'll be no issues for Vmin to shift.
As an anecdotal experience from someone I know that runs these in prod for game servers, limiting the CPU to 80°C and 1.4V-1.45V, 400A has been keeping them alive for years doing 24/7 loads. Maybe a bit lower on the voltage if one wants to be sure longer term, as they are fine with just mass RMAing these. There's also large amount of differences in the silicon quality between samples that can make one run cool and completely fine even at the old stock settings, and an another sample that'll have to pull say 1.5x the power for the same load and clocks having it degrade.
You're implying that if you don't run the CPU at high power and high heat it won't have problems, and that undervolting or underclocking will prevent damage. This is not correct: while that is helpful, Vmin degradation occurs during idle or light activity as well
Vmin will creep up, and the headroom for undervolting will degrade. It will affect the high clocks first (they demand the highest voltage), which is why dropping the max boost multiplier a step or two can also work around it (at the cost of basically downgrading it to a cheaper processor)
Idle and light load is bad for degradation only because that's the most common scenario where the boosting algorith will actually go to the highest clocks. More loaded cores will have the CPU target lower clocks on all cores so that it actually can get the power for it and have the CPU be coolable, but if you're idle and then some task loads just a single core for a bit the CPU will boost it the highest it can. The voltage spikes from those boosts will cause local hotspots even if the CPU is cool overall