Comment by wbl

9 years ago

CPU manufacturers do do huge amounts of testing, and Intel does formal verification of some functional units. The reliability is far better than most software, in part because making a new release costs billions.

In my limited experience, their root cause analyses are really impressive as well with lots of internal attention and resources. I'm not allowed to talk about any Intel issues, but we reported a very strange issue to Nvidia, sent a couple of dozen cards back and six months later got a truly fascinating report back we with hundreds of pages of compute test result tables and electron microscope images and chemistry lab reports. Anything that hints of a manufacturing problem is taken incredibly seriously.

  • I worked for a large company that used thousands of Intel CPUs every year and when we suspected a CPU bug we were mostly brushed off. We had a very persistent person on the team who kept tracking the issue to find correlations and some very good kernel developers that went on to nearly pin-point the issue and only then did Intel pay attention and it then took them still several months to acknowledge the issue give a brief report on the issue and acknowledge that our proposed workaround will indeed work.

    I've never seen Intel do a very good job at failure analysis or following on with failures unless prodded very hard.

    • Intel likely has very thorough data on the issue, but you'll never see it unless you are one of their tier 1 customers* and have an NDA with them. In my experience working for a large hardware manufacturer they are very skittish about releasing detailed failure analysis data to outside companies.

      * For Intel that would be companies like Dell, Apple, HP, and maybe a couple of others.

    • To be fair, I don't even want to think about the amount of spurious bug reports they are probably receiving.

  • This is because once upon a time people were put in jail for not doing that (when the customer was the DoD).

That's absolutely true. When it comes to CPU/memory, skilled software engineers always think, "it must be my bug, it always is".

So in that super rare case of actually running into a CPU defect, it's a mindfuck, it'll drive you crazy. You'll be looking for the flaw in your algorithm which makes it fail once a week under production load. But you just can't find it, it makes no sense ...

(When it comes to drivers for network/storage/graphics etc devices, it's a whole different story. Those things are piles of bugs that need work-arounds in drivers.)

  • A little anecdote describing one such bug. I didn't find this, another one of my teammates did.

    The symptom was that a board with a specific microcontroller on it would be working fine, then after a power cycle it might not keep working. A flash dump would show that the reset vector, the first byte of flash on that system, would be all zeroed out. Of course the system would not run anymore, but why did it happen? After months of intermittent debugging and trying to reproduce the cause was determined. At least under certain conditions the brownout detection level was lower than the voltage level that caused the CPU to make errors. If the board lost power slowly then the CPU would start executing corrupted / arbitrary instructions which generally included lots of zeros. It would occasionally write zeros to the zero address, bricking the board.

    Since then we have external power monitoring and reset circuits on all the new boards, but existing ones needed a fix. Luckily the board had power failure interrupt connected, so when that triggers we reconfigure the CPU to execute on the slowest possible clock rate, which greatly reduced the occurrence.

    • Yes, we have seen similar in AVRs (328p) where the lowest voltage brownout threshold/trip is not enough to prevent the EEPROM and/or Flash from getting corrupted somehow...

  • Similarly: Kernel bugs! Once upon a time I spend a good week trying to reproduce a rare crash one of our users saw ever so often (just enough to get our attention, but not enough to get a repro case). Stack traces made no sense and just in general the whole crash made no sense the more I looked at it. Turns out it was a bug in XNUs pthread subsystem, a part where I would have never looked if it weren‘t for desperation, because, well, it‘s the kernel, it works, right?

  • You have some lovely stories online about these things. Like the guy tracking down a stuck bit in RAM.

    Or on the network level, a VPN that failed only when traversing one possible route between company offices.

    • I have a very old one. I was working on a embedded system using an Intel 80186. The ‘186 had a memory mapped IO system that was very unusual for Intel x86 CPUs. The code that I started with was originally written by a bunch of Motorola 68000 guys so they used this mode. I had to modify the interrupt structure so I rewrote the interrupt service routines. Since it was memory mapped and ANDs were documented to be atomic, I assumed that I could just AND bits into the interrupt mask register. Big mistake. It turned out that ANDs to the IO registers were not atomic. And could be interrupted in the middle of the write after read. I took me about a month to realize that this was a hardware “bug” and not in my interrupt code. Drove me nuts.

    • These problems have existed for years - I once tracked down a dead cell in core memory on a system I worked on in the Air Force. Because of the chance alignment, it caused incoming messages to never "end". Which was fun for the operators.

      http://www.1882nd.com/images/DSTE.jpg

It may be far better than average software quality but far more also relies on it. The question is whether the quality is adequate in light of what is at stake.

  • Really? How is it different from a kernel bug that causes random behaviour? Both apparently can be fixed with a software update. They're both bad, but why should Intel be held to a higher standard if mitigation is similarly complex?

releasing cpu micro code to work around bugs is very cheap.

  • But that doesn't mean you necessarily get to keep the affected feature - the only 'work-around' might be to disable it. Consider TSX on Haswell and Broadwell, where Intel had to disable the whole feature because of a bug. And of course there was the 486 FP bug which couldn't be fixed by any kind of microcode update.

    If Intel had to completely disable hyperthreading in Skylake and Kabylake that would make the premium anyone paid for i5 vs i7 worthless.

i am more willing to accept software having bugs / failing. i have 0 tolerance for hardware to have bugs

  • I wonder if there's a CPU out there that doesn't have bugs.

    • It depends on how exactly you define a bug, but something like a 6502 or Z80, which has essentially all the errata well-documented since it has existed for so long, might qualify.

      1 reply →

    • stop playing lawyer ball. i'm talking about the bugs that require turning off core functionality (such as hyperthreading) and/or that lead to system halts / corrupted data.

      yes it happens, but software bugs happen every day where as if your system blue screened every day you know dang well you'd be on the phone with the hardware vendor for a refund / new system.

      2 replies →

  • How much more would you pay, or how much of a speed penalty would you take, for bug-free CPUs?

This seems like precisely the sort of thing that a competent manufacturer should rule out formally. Formal verification of individual FUs isn't exactly ambitious...

I think we're getting to levels of complexity where the process Intel uses, with lots of different QA and testing teams doing their best to look for bugs, just isn't going to cut it. We need formally verified models transformed step-by-verified-step all the way down to the silicon. It's already feasible, with free tools, to formally verify your high-level model (using e.g. LiquidHaskell) and then transform this to RTL (using e.g. Clash). With Intel's QA/testing budget, it's well within reach to A) verify the transformation steps and B) figure out how to close the performance gap between machine-generated (but maybe slower) and hand-rolled (faster, but evidently wrong) silicon.

  • I've done formal verification on several units of a microprocessor, and I can assure you, formal verification on individual FUs is very, very ambitious (think impossible) for a modern microprocessor.

    For example, you can not possibly formally verify the fetcher unit on its own, because the state space that you need to cover for several cycles for all the module inputs and outputs is beyond the capability of any formal verification tool.

    Typically, you run formal verification on sub-blocks of sub-blocks of FUs.

    For this particular bug, it looks like multiple functional units are involved, so it might have been missed by formal verification.

    • > because the state space that you need to cover for several cycles for all the module inputs and outputs is beyond the capability of any formal verification tool.

      Then use a proof methodology that doesn't require exhaustive enumeration. This objection is actually fairly alarming to me; perhaps there is a larger disconnect between industrial formal techniques for hardware and software than I thought. Strings also occupy a "large state space", but this obviously doesn't prevent us from doing formal verification on functions over strings.

      3 replies →

    • I've had great success (in software) with randomized/probability based testing. You don't walk the full state space but generate subsets randomly to walk.

      If done naively, it doesn't work because you are unlikely to hit the problematic bug. So you have to be clever: do not generate inputs uniformly! Generate those inputs which are really really nasty all the time. If you find a problem, then gradually simplify the case until you have reduced the complexity to something which can be understood.

      With some experience for where former bugs tend to live, I've been able to remove lots of bugs from my software in edge cases via this method. (See "QuickCheck").

  • So it's okay for software to have bugs that get fixed (I think everybody here acknowledges that software will always have bugs), but Intel isn't allowed to have issues in their processors, even if they can fix them with a software (microcode) update?

    • In general, people expect hardware to be far more bug-free than all the software which depends on it. I'm disgusted enough at the current trend of continuously releasing half-baked software and "updating" it to fix some of the bugs (while introducing new bugs in the process); and yet it seems to be spreading to hardware too. From the perspective of formal verification, buggy hardware is like an incorrect axiom.

      but Intel isn't allowed to have issues in their processors, even if they can fix them with a software (microcode) update?

      They could've caught and fixed this one with some more testing, before actually releasing. Formal verification isn't necessarily going to help, if it's a statistical type of defect --- thus the concept of wafer yield, and why dies manufactured from the exact same masks can behave completely differently with respect to the clock speeds attainable and voltages required.

      4 replies →

    • Well, if a software bug leads to data corruption it is not really "allowed" as well. It is quite amazing that there are not more bugs like this. Intel spends an insane amount of money on very large teams of very smart people to work on these insanely complicated systems.

      I don't think this is sustainable, and I think that massive arrays of relatively simple processors. First, we will need a culture shift that learns programmers to program concurrent programs from the start, and this will take a lot of time (because technology moves a lot faster than culture).

    • It would be nice if they allowed users to update rather than vendors. I have a mobile Xeon. I don't know if it's affected or not.

    • It's funny, since many of today's software bugs are due to the legacy of C, faults of which can often times be attributed to mimicking the behaviour CPU hardware too closely.

      EDIT: My wording was poor, but what I meant was that it's not like hardware vendors do not make bugs happen due to deficiencies in their design, which are not only tolerated, but often reflected in software, which runs counter to OP saying that apparently hardware vendors are not allowed to have bugs, but the software ones are:

      > So it's okay for software to have bugs that get fixed (I think everybody here acknowledges that software will always have bugs), but Intel isn't allowed to have issues in their processors

      3 replies →

  • Formal verification of a decode unit simply will never happen. The complexity is far too high.

    Formally verifying something like a multiplier block is difficult but doable if you care. Formally verifying an FPU is probably at about the limit.

    If you want formal verification, you would have to simplify a modern microprocessor a lot.

    • That's the whole point of doing a provably correct transformation from a simpler model domain. Even complicated OOOE engines are fairly tractable if you don't try to implement them at an RTL level...

      Perhaps we are imagining different formal verification methodologies. Can you tell me what kind of formal verification you're referring to?