Comment by userbinator

2 years ago

Reminds me of Sudden Northwood Death Syndrome, 2002.

Looks like history may be repeating itself, or at least rhyming somewhat.

Back then, CPUs ran on fixed voltages and frequencies and only overclockers discovered the limits. Even then, it was rare to find reports of CPUs killed via overvolting, unless it was to an extreme extent --- thermal throttling, instability, and shutdown (THERMTRIP) seemed to occur before actual damage, preventing the latter from happening.

Now, with CPU manufacturers attempting to squeeze all the performance they can, they are essentially doing this overclocking/overvolting automatically and dynamically in firmware (microcode), and it's not surprising that some bug or (deliberate?) ignorance that overlooked reliability may have pushed things too far. Intel may have been more conservative with the absolute maximum voltages until recently, and of course small process sizes with higher potential for electromigration are a source of increased fragility.

Also anecdotal, but I have an 8th-gen mobile CPU that has been running hard against the thermal limits (100C) 24/7 for over 5 years (stock voltage, but with power limits all unlocked), and it is still 100% stable. This and other stories of CPUs in use for many years with clogged or even detached heatsinks seem to contribute to the evidence that high voltage is what kills CPUs, and neither heat nor frequency.

Edit: I just looked up the VCore maximum for the 13th/14th processors - the datasheet says 1.72V! That is far more than I expected for a 10nm process. For comparison, a 1st-gen i7 (45nm) was specified at 1.55V absolute maximum, and in the 32nm version they reduced that to 1.4V; then for the 22nm version it went up slightly to 1.52V.

> Back then, CPUs ran on fixed voltages and frequencies and only overclockers discovered the limits. Even then, it was rare to find reports of CPUs killed via overvolting, unless it was to an extreme extent --- thermal throttling, instability, and shutdown (THERMTRIP) seemed to occur before actual damage, preventing the latter from happening.

Oh the memories. I had a Thunderbird-core Athlon with a stock frequency of (IIRC) 1050Mhz. It was stable at 1600Mhz, and I ran it that way for years. I was able to get it to 1700Mhz, but then my CPU's stability depended on ambient temperatures. When the room got hot in the summer my workstation would randomly kernel panic.

Interesting, I hadn’t heard about the Pentium overlocking issues. My theory on the current issue that running chips for long periods of time at 100C is not good for chip longevity, but voltages could also be an issue. I came up with this theory last summer when I built my rig with a 13900k, though I was doing it with the intention of trying to set things up so the CPU could last 10 years.

Anecdotally, my CPU has been a champ and I haven’t noticed any stability issues despite doing both a lot of gaming and a lot of compiling on it. I lost a bit of performance but not much setting a power limit of 150W.

I believe the first round of Intel excuses here blamed the motherboard manufacturers for trying to "auto" overclock these CPUs.