Finding a CPU Design Bug in the Xbox 360 (2018)

5 years ago (randomascii.wordpress.com)

Bruce Dawson taught a 100 level class at the college I attended, it's been a while but I think the class title was something around image processing. The first or second assignment was to copy one image region into another, little did we know that we implemented bit blit.

After turning in the assignment the next class he grabbed one random assignment from the class and added some quick benchmark instrumentation and if I remember right then asked the class how fast we think it could run. The next 30 minutes was a total geek out on cache lines, prefetching, alignment and all the other dark arts of performance optimizations. I don't remember how fast he got it to go over the naive for loop but it was many orders of magnitude. There's a few classes that really stick with you and that was definitely one of them.

  • The only thing that stands out to me in college are the 100 level classes at the beginning and the 400 level classes in the middle that werent in my major

    Maybe its the imposter syndrome coupled with actually being unqualified and incompetent which makes those classes stand out

    But we can just pretend its only imposter syndrome

It sounds like a mispredicted xcdbt would actually need to invalidate or otherwise make coherent the cache lines that it mistakenly made incoherent, which also affects the instructions that read the incorrect data, so effectively a full pipeline flush. Even if they got that right, I suspect it would still result in some interesting performance anomalies whenever a mispredicted xcdbt was speculatively executed and then "cancelled".

It's notable that in 2005, it was already near the end for the P4/Netburst with its insanely 31-stage pipeline, and CPU designs were moving towards increasing IPC rather than clock frequency.

  • The question of how to properly implement an instruction like xdcbt is interesting. Undoing the damage would be both tricky and expensive. Only doing the L2-skipping when the instruction is executed (as opposed to speculatively executed) would probably be way too late. It seems that such an instruction is probably not practical to implement correctly.

    I never asked any of the IBM CPU designers this question (it was too late to make changes so it wasn't relevant) and now I regret that.

    • Re-reading the post, it sounds like the conclusion was just “don’t use it” / label it as dangerous. Why didn’t they end up marking the instruction as “can’t speculate this”?

      (I can imagine wanting to keep it for straight line unrolled copies that don’t have prediction, but it still seems dicey given that you’d have to write any code with knowledge of the speculative fetches).

      2 replies →

I've been working on verifying memory coherency units of modern out-of-order CPUs for a few years now. Nowadays, this would be a huge miss if it were to escape to silicon. You'd have a dead on arrival product.

  • I think the specific application of the CPU here makes it more palettable. It was a nice idea, but it doesn't work out, so scan for it and don't publish games that have that opcode; if possible, issue a microcode update that makes it either a noop or an illegal operation.

    • Game console CPUs get away with all sorts of brokenness. The Wii U CPU also has broken coherency, and needs workaround flush instructions in the core mutex primitives. You can't run standard PowerPC Linux on it multicore for this reason.

    • I agree that console silicon can be rushed and bugs slip through. But then again, we may not even be aware of all the bugs in Intel or AMD products where they have been fixed post silicon via mechanisms such as microcode patches.

  • What does ‘escape to silicon’ mean?

    • If Intel ships a CPU with a bug in it then that is an expensive mistake. If they produce a bunch of CPUs with a bug (escape to silicon) that can also be an expensive mistake.

      That said: 1) I deal with CPU bugs pretty regularly on Chrome. Some old CPUs behave unreliably with certain sequences of instructions and we get bursts of crashes on these CPUs with some Chrome versions. 2) Intel regularly "fixes" CPU bugs with microcode updates. 3) The Spectre family of vulnerabilities are arguably CPU bugs. 4) The Pentium fdiv bug was definitely a CPU bug.

      So, CPU bugs escape to silicon all the time. Way more than I would have guessed just a few years ago. Our industry is built on a pile of wobbly sand.

If the PREFETCH_EX flag was never passed, why did the branch predictor speculatively execute xdcbt? That branch would never be taken, so it ought not to be predicted.

Was it a static branch prediction hint? I know PowerPC has that. If so, could it have fixed by just flipping the hint bit?

  • From the article:

    Instead simple branch predictors typically squish together a bunch of address bits, maybe some branch history bits as well, and index into an array of two-bit entries. Thus, the branch predict result is affected by other, unrelated branches, leading to sometimes spurious predictions.

> And that was the problem – the branch predictor would sometimes cause xdcbt instructions to be speculatively executed and that was just as bad as really executing them. One of my coworkers (thanks Tracy!) suggested a clever test to verify this – replace every xdcbt in the game with a breakpoint. This achieved two things: > > 1. The breakpoints were not hit, thus proving that the game was not executing xdcbt instructions. > 2. The crashes went away.

I love the simplicity and the genius behind this idea.

Would it be possible to tell the CPU not to apply branch prediction in this one case?

  • There's a "branch hint" in PowerPC, I wonder if that's acted on by the Xenon CPU in question.

    edit: it's discussed in the comments as well but they don't know either. The author responds: "I can’t remember how PowerPC branch hints work but if the branch hint overrode the branch predictor then it could have avoided the bug."

  • No, because the branch predictor is just "on" or "off". You'd be wondering if you can make it not speculatively execute specific instructions. I'm going to infer from the article that there wasn't a way to do that and it was far too late to spin a new CPU revision to prohibit speculatively executed xdcbt.

    • I assume one could have put a serializing instruction inside the "dangerous" path, which should then lead to the speculative execution path being rolled back before reaching the dangerous instruction. Obviously it'd also be expensive...

      3 replies →

  • You probably could have done the equivalent if it were IA-64, but I suspect you can't do this with plain old branch-prediction mechanisms.