← Back to context

Comment by userbinator

9 years ago

The problem description is short and scary:

Problem: Under complex micro-architectural conditions, short loops of less than 64 instructions that use AH, BH, CH or DH registers as well as their corresponding wider register (e.g. RAX, EAX or AX for AH) may cause unpredictable system behavior. This can only happen when both logical processors on the same physical processor are active.

I wonder how many users have experienced intermittent crashes etc. and just nonchalantly attributed it to something else like "buggy software" or even "cosmic ray", when it was actually a defect in the hardware. Or more importantly, how many engineers at Intel, working on these processors, saw this happen a few times and did the same.

More interestingly, I would love to read an actual detailed analysis of the problem. Was it a software-like bug in microcode e.g. neglecting some edge-case, or a hardware-level race condition related to marginal timing (that could be worked around by e.g. delaying one operation by a cycle or two)? It reminds me of bugs like http://danluu.com/cpu-bugs/ suggests to me that CPU manufacturers should do more regression testing, and far more of it. I would recommend demoscene productions, cracktros, and even certain malware, since they tend to exercise the hardware in ways that more "mainstream" software wouldn't come close to. ;-)

(To those wondering about ARM and other "simpler" SoCs in embedded systems etc.: They have just as much if not more hardware bugs than PCs. We don't hear about them often, since they are usually worked around in the software which is usually customised exactly for the application and doesn't change much.)

A few past lives ago, I used to work on the AIX kernel at IBM. I once spent a few weeks poring through trace data trying to investigate a very mysterious cache-aligned memory corruption induced by a memory stress test. Our trace data was quite comprehensive, and is always turned on due to its very low overhead. It was concerning enough (and took me long enough) that it eventually sucked in the rest of my team to aid in the investigation. None of these other guys were noobs- a couple of them had (at the time) built over 20 years of experience in this system, and in diagnosing similar memory corruption bugs beyond any doubt (many were due to errant DMAs from device drivers). I had too, though for much less than 20 years.

After several full days of team-wide debugging, we had no better explanation based on the available evidence than cosmic rays, or a hardware bug. IBM's POWER processor designers worked across the street from us, so we tried to get them to help- first by asking nicely, then by escalating through management channels.

Their reply was more or less: we've run our gamut of hardware tests for years, and your assertion that it's hardware related is vanishingly unlikely... we don't look into hardware bugs unless you can prove to us beyond a doubt it's hardware related. Cache-aligned memory corruption without any other circumstantial evidence isn't enough.

On a crashed test system sitting in the kernel debugger for several weeks now, there would be no more circumstantial evidence beyond the traces. A corruption like this was never seen again, by all accounts.

If we were right and it was evidence of a hardware failure, this is one way such a problem can go undetected. I hope it was something else, or even a cosmic ray, but we'll never know for sure, I guess.

  • I understand that someone at Microsoft Research once found a bug in the XB360's memory subsystem by model checking a TLA+ spec of it. The story goes that IBM initially refused to believe the bug report. A few weeks later they admitted that such a bug did indeed exist, had been missed by all their testing and would have resulted in system crashes after about 4 hours use.

  • Did you by chance see the paper a year back or so outlining that memory errors are more likely to occur near page boundaries? The author's premise was that a lot of 'cosmic rays' are just manufacturing flaws.

    • I debugged lots of memory corruption errors in my time working on AIX kernel stuff. Most of the ones I got to work on did indeed happen at a page boundary [1]. I think there is a pretty simple reason for this: "pages" are both the logical unit size of memory the kernel's allocator hands out, and the unit size that the hardware is capable of addressing in most cases. Therefore, when something at the kernel level is done incorrectly, it often references someone else's page.

      It's also possible for a _byte_ offset to be wrong, and these types of errors need not occur at a page boundary. Useful things to do with a raw memory dump with this kind of corruption (at least in AIX kernel):

      - Identify the page containing the corruption, and find all activity concerning the physical frame in the traces.

      - Try to reverse engineer the bad data. Often times, there are pointers you can follow. You would have to manually translate the virtual address to physical frames, but that's pretty simple to do, both for user space and kernel space (in our case, it was always in kernel space, which was 64-bit and just used a simple prefix).

      From there, you just have to be crafty and thorough in following the breadcrumbs and either identifying the bug in code, or at least who should investigate next.

      In my original post, note that the corruption was on a _CPU cache_ boundary (128 bytes in POWER). IIRC, the containing page was allocated and pinned [2] for a longer time than the trace buffer tracked (it's been a few years, though).

      [1] To make things fun, AIX and POWER supports multiple page sizes- 4K, 64K, 16MB, 16GB. Hardware also has the ability of promoting / demoting between 4K and 64K for in-use memory... lots of fun :-)

      [2] AIX is a pageable kernel. If kernel code can't handle a page fault, it needs to "pin" the memory.

      ... note that any of this can be really outdated, it's been almost a decade since I was an expert in this stuff :-)

      (Edit: formatting... how do you make a pretty bulleted list on HN??)

      1 reply →

    • Shouldn't ECC offer a second form of benchmark on this? If you see transient, cosmic-ray looking errors in ECC, presumably that's much stronger evidence of a hardware bug. Of course, I've also heard it claimed that ECC design and manufacture are held to a higher standard that might hide the issue.

      1 reply →

    • having often attributed these things to cosmic rays as well, anyone have any actual estimates on how often/likely cosmic rays are to cause errors in desktop-like setups?

      had to have been a cosmic ray at least once right?

      2 replies →

> short loops of less than 64 instructions that use AH, BH, CH or DH registers as well as their corresponding wider register (e.g. RAX, EAX or AX for AH)

This is yet another of the many places where the complexity of the x86 ISA shows up and makes its hardware implementations more complicated: the x86 ISA has instructions which can modify the second-lowest byte of a register, while keeping the rest of the rest of the register unmodified (but AFAIK no instructions which do the same for the third-lowest byte, showing its lack of orthogonality).

For in-order implementations, like the ones which originated the x86 ISA, it's not much of a problem. But for out-of-order implementations, which do register renaming, partial register updates are harder to implement, since the output value depends on both the output of the instruction and the previous value of the register. The simplest implementation would be to make a instruction depending on the new value wait until it's committed to the physical register file or equivalent, and that's probably how it was done for these instructions for these partial registers before Skylake.

For Skylake, they probably optimized partial register writes to these four partial "high" registers (AH, BH, CH, DH), but the optimization was buggy in some hard-to-hit corner case. That corner case probably can only be reached when some part of the out-of-order pipeline is completely full, which is why it needs a short loop (so the decoder is not the bottleneck, AFAIK there's a small u-op cache after the decoder) and two threads in the same core (one thread is probably not enough to use up all the resources of a single core). The microcode fix is probably "just" flipping a few bits to disable that optimization.

And this shows how a ISA is more than just the decoding stage; design decisions can affect every part of the core. In this case, if your ISA does not have partial register updates (usually by always zero-extending or sign-extending when writing to only part of a register, instead of preserving the non-overwritten parts of the previous value), you won't have the extra complexity which led to this bug. AMD partially avoided this when doing the 64-bit extension (a partial write to the lower 32 bits of a register clears the upper 32 bits), but they kept the legacy behavior for writes to the lower 16 bits, or to either of the 8-bit halves of the lower 16 bits.

  • The loop needs to be short because the loopback buffer is only active in loops of 64 or fewer entries (usually fewer real instructions, something like 40 or so). Moreover, Skylake introduced one loopback buffer per thread, instead of the previous loopback buffer shared between both threads.

    My guess is that is where the bug is; the behavior for partial register access stalls---insert one extraneous uop to combine, e.g., ah with rax---is unchanged since Sandy Bridge.

    • Just as information, the Loop Stream Detector was introduced in Intel Core microarchitectures. With Sandy Bridge, it was 28 μops. It grows to 56 μops (with HT off) with Haswell and Broadwell. It grows again (considerably) with Skylake to 64 μops per thread (HT on or off).

      The LSD resides in the BPU (branch prediction unit) and it is basically telling the BPU to stop predicting branches and just stream from the LSD. This saves energy. However, predicting is different than resolving. Branch resolution still happens and when resolution (speculation) fails, the LSD bails out.

      In any case, 64 μops is a lot. That's a good sized inner loop.

  • It's also a problem with SMT[1]. The design cost is pretty small, it's a fairly straightforward extension of what an out of order CPU is already doing. But due to the concurrency issues debugging/verifying it is incredibly difficult.

    [1]Simultanious MultiThreading, which is marketed by Intel under the name Hyperthreading when using two threads.

  • This is an amazing analysis, and seems entirely likely to be right to me. Thanks for writing it up.

  • You really don't know what you're talking about.

    ---

         For Skylake, they probably optimized partial register 
         writes to these four partial "high" registers (AH, BH, 
         CH, DH), but the optimization was buggy in some hard-to-
         hit corner case.
    

    They did not do this.

    The high registers (AH/BH/DH/CH) are nearly written out of existence with the REX Prefix in 64bit mode. Within the manual(s) it is called out effectively not to use them as they're now emulated and not support directly in hardware.

    The 16bit registers (AX/BX/DX/CX) are in worse situation, but it ends up costs additional cycles to even decode these instructions as the main encoder can't handle these instructions and you have to swap to the legacy encoder, and you'll end up losing alignment. This costs ~4-6 cycles, also the perf registers to track were only added in Haswell (and require Ring0 to use [2]).

    High Register and 16bit registers are huge wart that it seems Intel is trying desperately hard to get us to stop using.

        That corner case probably can only be reached when some
        part of the out-of-order pipeline is completely full,
        which is why it needs a short loop (so the decoder is not
        the bottleneck, AFAIK there's a small u-op cache after the decoder)
    

    There is a 64uOP cache between the decoder and L1i cache that is called loop stream detector. Normally this exists to do batched writes to the L1i cache.

    But in _some_ scenarios when a loop can fit completely within this cache it'll be given extremely priority. This is a way to max out the 5uOP per cycle Intel gives you [1]. It'll flush its register file to L1 cache piece meal as it continues to predict further and further and further ahead speculatively executing EVERYPART OF IT in parallel. [3]

    In short this scenario is extremely rare. uOPs have stupidly weird alignment rules. Which you can boil down to:

        Intel x64 Processor are effectively 16byte VLIW RISC processors
        that can pretend to be 1-15byte AMD64 CISC processors at a minor performance
        cost. 
    

    ---

    The real issue here is when Loop Stream mode ends it is properly reloading the register file, and OoO state.

    This is likely just a small micro-code fix. The 8low/8high/16bit/32bit/64bit weirdness is likely somebody wasn't doing alignment checks when flushing the register file.

    ---

    [1] On Skylake/KabyLake. IvyBridge, SandyBridge, Haswell, and Boardwell limited this to 4.

    [2] Volume 3 performance counting registers I think we're up to 12 now on Boardwell.

    [3] Volume 3 Chapter 3.4.1.7 (Page 107)

    • > The high registers (AH/BH/DH/CH) are nearly written out of existence with the RAX flag in 64bit mode. Within the manual(s) it is called out effectively not to use them as they're now emulated and not support directly in hardware.

      I think you meant REX prefix but even that doesn't make any sense.

      High registers are a first class element of the Intel 64 and IA-32 architectures. They aren't going anywhere. Microarchitectural implementations are an entirely different thing.

      That aside, where in the manuals does Intel say not to use the high registers? They're pretty clear about such warnings and usually state them in Assembly/Compiler Coding Rules.

      From the parent:

      > For Skylake, they probably optimized partial register writes to these four partial "high" registers (AH, BH, CH, DH), but the optimization was buggy in some hard-to-hit corner case.

      That is about right. I don't agree with the preceding slap at x86 but this is a good summary.

      BTW writing to the low registers is in principle also a partial register hazard but then Intel sees fit to optimize that as a more common case.

      In particular, mov AH,BH is not emulated from the MS-ROM which is just hella slow. It uses two μops for Sandy Bridge and above. This is covered in 3.5.2.4 Partial Register Stalls.

      Lastly, there is no section 3.4.1.7 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual which is 3 volumes. You must be talking about the Intel® 64 and IA-32 Architectures Optimization Reference Manual which is a single volume. And it isn't clear how that section furthers your argument.

    • > High Register and 16bit registers are huge wart that it seems Intel is trying desperately hard to get us to stop using.

      Someone really ought to tell clang and gcc this; they both happily use 16-bit registers for 16-bit arithmetic.

      Anyway, Intel obviously already has special optimizations for many partial register accesses, dating back to Sandy Bridge. While it's quite possible that they left out the high registers initially (no clue, don't care), if they did they could have decided to include them in Skylake. Who knows though...

      What are you even talking about with the LSD? The LSD is entirely before any register renaming and the entire out-of-order bits of the CPU. It's likely the LSD is involved only because that (plus hyperthreading) might be the only way to get enough in-flight µops to trigger whatever is going wrong, whether or not it's due to optimizations for partial register accesses.

      3 replies →

CPU manufacturers do do huge amounts of testing, and Intel does formal verification of some functional units. The reliability is far better than most software, in part because making a new release costs billions.

  • In my limited experience, their root cause analyses are really impressive as well with lots of internal attention and resources. I'm not allowed to talk about any Intel issues, but we reported a very strange issue to Nvidia, sent a couple of dozen cards back and six months later got a truly fascinating report back we with hundreds of pages of compute test result tables and electron microscope images and chemistry lab reports. Anything that hints of a manufacturing problem is taken incredibly seriously.

    • I worked for a large company that used thousands of Intel CPUs every year and when we suspected a CPU bug we were mostly brushed off. We had a very persistent person on the team who kept tracking the issue to find correlations and some very good kernel developers that went on to nearly pin-point the issue and only then did Intel pay attention and it then took them still several months to acknowledge the issue give a brief report on the issue and acknowledge that our proposed workaround will indeed work.

      I've never seen Intel do a very good job at failure analysis or following on with failures unless prodded very hard.

      3 replies →

    • This is because once upon a time people were put in jail for not doing that (when the customer was the DoD).

  • That's absolutely true. When it comes to CPU/memory, skilled software engineers always think, "it must be my bug, it always is".

    So in that super rare case of actually running into a CPU defect, it's a mindfuck, it'll drive you crazy. You'll be looking for the flaw in your algorithm which makes it fail once a week under production load. But you just can't find it, it makes no sense ...

    (When it comes to drivers for network/storage/graphics etc devices, it's a whole different story. Those things are piles of bugs that need work-arounds in drivers.)

    • A little anecdote describing one such bug. I didn't find this, another one of my teammates did.

      The symptom was that a board with a specific microcontroller on it would be working fine, then after a power cycle it might not keep working. A flash dump would show that the reset vector, the first byte of flash on that system, would be all zeroed out. Of course the system would not run anymore, but why did it happen? After months of intermittent debugging and trying to reproduce the cause was determined. At least under certain conditions the brownout detection level was lower than the voltage level that caused the CPU to make errors. If the board lost power slowly then the CPU would start executing corrupted / arbitrary instructions which generally included lots of zeros. It would occasionally write zeros to the zero address, bricking the board.

      Since then we have external power monitoring and reset circuits on all the new boards, but existing ones needed a fix. Luckily the board had power failure interrupt connected, so when that triggers we reconfigure the CPU to execute on the slowest possible clock rate, which greatly reduced the occurrence.

      1 reply →

    • Similarly: Kernel bugs! Once upon a time I spend a good week trying to reproduce a rare crash one of our users saw ever so often (just enough to get our attention, but not enough to get a repro case). Stack traces made no sense and just in general the whole crash made no sense the more I looked at it. Turns out it was a bug in XNUs pthread subsystem, a part where I would have never looked if it weren‘t for desperation, because, well, it‘s the kernel, it works, right?

      3 replies →

    • You have some lovely stories online about these things. Like the guy tracking down a stuck bit in RAM.

      Or on the network level, a VPN that failed only when traversing one possible route between company offices.

      2 replies →

  • It may be far better than average software quality but far more also relies on it. The question is whether the quality is adequate in light of what is at stake.

    • Really? How is it different from a kernel bug that causes random behaviour? Both apparently can be fixed with a software update. They're both bad, but why should Intel be held to a higher standard if mitigation is similarly complex?

      9 replies →

  • releasing cpu micro code to work around bugs is very cheap.

    • But that doesn't mean you necessarily get to keep the affected feature - the only 'work-around' might be to disable it. Consider TSX on Haswell and Broadwell, where Intel had to disable the whole feature because of a bug. And of course there was the 486 FP bug which couldn't be fixed by any kind of microcode update.

      If Intel had to completely disable hyperthreading in Skylake and Kabylake that would make the premium anyone paid for i5 vs i7 worthless.

      4 replies →

  • This seems like precisely the sort of thing that a competent manufacturer should rule out formally. Formal verification of individual FUs isn't exactly ambitious...

    I think we're getting to levels of complexity where the process Intel uses, with lots of different QA and testing teams doing their best to look for bugs, just isn't going to cut it. We need formally verified models transformed step-by-verified-step all the way down to the silicon. It's already feasible, with free tools, to formally verify your high-level model (using e.g. LiquidHaskell) and then transform this to RTL (using e.g. Clash). With Intel's QA/testing budget, it's well within reach to A) verify the transformation steps and B) figure out how to close the performance gap between machine-generated (but maybe slower) and hand-rolled (faster, but evidently wrong) silicon.

    • I've done formal verification on several units of a microprocessor, and I can assure you, formal verification on individual FUs is very, very ambitious (think impossible) for a modern microprocessor.

      For example, you can not possibly formally verify the fetcher unit on its own, because the state space that you need to cover for several cycles for all the module inputs and outputs is beyond the capability of any formal verification tool.

      Typically, you run formal verification on sub-blocks of sub-blocks of FUs.

      For this particular bug, it looks like multiple functional units are involved, so it might have been missed by formal verification.

      5 replies →

    • So it's okay for software to have bugs that get fixed (I think everybody here acknowledges that software will always have bugs), but Intel isn't allowed to have issues in their processors, even if they can fix them with a software (microcode) update?

      24 replies →

    • Formal verification of a decode unit simply will never happen. The complexity is far too high.

      Formally verifying something like a multiplier block is difficult but doable if you care. Formally verifying an FPU is probably at about the limit.

      If you want formal verification, you would have to simplify a modern microprocessor a lot.

      1 reply →

> I wonder how many users have experienced intermittent crashes

I wonder if it's exploitable ;) Maybe that's why they never release the details of these CPU bugs.

> Was it a software-like bug in microcode e.g. neglecting some edge-case, or a hardware-level race condition related to marginal timing

Not sure about microcode, these x86 cores execute many simple operations natively, by means of dedicated circuits. Microcode is only involved in emulation of complex x86 instructions.

And hardware problem doesn't have to be marginal timing. It could simply be a logic bug, i.e. the circuit operates as designed but it was designed to do something else than it should be doing in some unforeseen circumstances.

I feel like a lot of the processors Intel has released recently that have had problems like this. Intel's Bay Trail processors like the Celeron J1900 have a huge problem around power state management (https://bugzilla.kernel.org/show_bug.cgi?id=109051) that's unlikely to ever get resolved and makes those processors almost unusable under a lot of conditions (random hard hangs on systems without watch dog timers really kind of sucks). I wonder if Intel has been more lax recently with how the systems get tested?

I've no knowledge about IC design, but it sounds to me that even the biggest name in CPU industry doesn't (or do they ever) do formal verification? Is the process like when I'm writing some mediocre code and say to myself: "hmm, it probably works", and throw the bunch into the version control (whereas they throw it to the wafer fab)?

> * I would recommend demoscene productions, cracktros, and even certain malware*

Modern PC demoscene productions don't really do very funky things CPU-wise anymore. Most run mostly in shaders, actually. Amiga and C64 is a different story, but Intel isn't making that many Amiga CPUs :-)