Finding a CPU Design Bug in the Xbox 360 (2018)

5 years ago (randomascii.wordpress.com)

32 comments

salgernon

Bruce Dawson taught a 100 level class at the college I attended, it's been a while but I think the class title was something around image processing. The first or second assignment was to copy one image region into another, little did we know that we implemented bit blit.

After turning in the assignment the next class he grabbed one random assignment from the class and added some quick benchmark instrumentation and if I remember right then asked the class how fast we think it could run. The next 30 minutes was a total geek out on cache lines, prefetching, alignment and all the other dark arts of performance optimizations. I don't remember how fast he got it to go over the naive for loop but it was many orders of magnitude. There's a few classes that really stick with you and that was definitely one of them.

vmception 5 years ago

The only thing that stands out to me in college are the 100 level classes at the beginning and the 400 level classes in the middle that werent in my major
Maybe its the imposter syndrome coupled with actually being unqualified and incompetent which makes those classes stand out
But we can just pretend its only imposter syndrome
skavi 5 years ago
Anyone know of any profs like this at UIUC?
- james_white 5 years ago
  
  Angrave was very memorable for me for CS241 Systems Programming (2016?). To this day my threading understanding is unparalleled to that of my colleagues due to him and that class.
  
  1 reply →
- bytematic 5 years ago
  
  Barton Miller is like this at UW-Madison

userbinator 5 years ago

It sounds like a mispredicted xcdbt would actually need to invalidate or otherwise make coherent the cache lines that it mistakenly made incoherent, which also affects the instructions that read the incorrect data, so effectively a full pipeline flush. Even if they got that right, I suspect it would still result in some interesting performance anomalies whenever a mispredicted xcdbt was speculatively executed and then "cancelled".

It's notable that in 2005, it was already near the end for the P4/Netburst with its insanely 31-stage pipeline, and CPU designs were moving towards increasing IPC rather than clock frequency.

brucedawson 5 years ago
The question of how to properly implement an instruction like xdcbt is interesting. Undoing the damage would be both tricky and expensive. Only doing the L2-skipping when the instruction is executed (as opposed to speculatively executed) would probably be way too late. It seems that such an instruction is probably not practical to implement correctly.
I never asked any of the IBM CPU designers this question (it was too late to make changes so it wasn't relevant) and now I regret that.
- boulos 5 years ago
  
  Re-reading the post, it sounds like the conclusion was just “don’t use it” / label it as dangerous. Why didn’t they end up marking the instruction as “can’t speculate this”?
  (I can imagine wanting to keep it for straight line unrolled copies that don’t have prediction, but it still seems dicey given that you’d have to write any code with knowledge of the speculative fetches).
  
  2 replies →

SethTro 5 years ago

Discussion from 2018: https://news.ycombinator.com/item?id=16094925

Side note, I'm happy to call Bruce a friend in the real world where he's an equally talented juggler and runner.

djmips 5 years ago

I wouldn't say equally otherwise he would probably be renowned as a juggler or runner.

seanmclo 5 years ago

I've been working on verifying memory coherency units of modern out-of-order CPUs for a few years now. Nowadays, this would be a huge miss if it were to escape to silicon. You'd have a dead on arrival product.

toast0 5 years ago
I think the specific application of the CPU here makes it more palettable. It was a nice idea, but it doesn't work out, so scan for it and don't publish games that have that opcode; if possible, issue a microcode update that makes it either a noop or an illegal operation.
- marcan_42 5 years ago
  
  Game console CPUs get away with all sorts of brokenness. The Wii U CPU also has broken coherency, and needs workaround flush instructions in the core mutex primitives. You can't run standard PowerPC Linux on it multicore for this reason.
- djmips 5 years ago
  
  I agree that console silicon can be rushed and bugs slip through. But then again, we may not even be aware of all the bugs in Intel or AMD products where they have been fixed post silicon via mechanisms such as microcode patches.
l33t2328 5 years ago
What does ‘escape to silicon’ mean?
- brucedawson 5 years ago
  
  If Intel ships a CPU with a bug in it then that is an expensive mistake. If they produce a bunch of CPUs with a bug (escape to silicon) that can also be an expensive mistake.
  That said: 1) I deal with CPU bugs pretty regularly on Chrome. Some old CPUs behave unreliably with certain sequences of instructions and we get bursts of crashes on these CPUs with some Chrome versions. 2) Intel regularly "fixes" CPU bugs with microcode updates. 3) The Spectre family of vulnerabilities are arguably CPU bugs. 4) The Pentium fdiv bug was definitely a CPU bug.
  So, CPU bugs escape to silicon all the time. Way more than I would have guessed just a few years ago. Our industry is built on a pile of wobbly sand.

ridiculous_fish 5 years ago

If the PREFETCH_EX flag was never passed, why did the branch predictor speculatively execute xdcbt? That branch would never be taken, so it ought not to be predicted.

Was it a static branch prediction hint? I know PowerPC has that. If so, could it have fixed by just flipping the hint bit?

userbinator 5 years ago
From the article:
Instead simple branch predictors typically squish together a bunch of address bits, maybe some branch history bits as well, and index into an array of two-bit entries. Thus, the branch predict result is affected by other, unrelated branches, leading to sometimes spurious predictions.
- ridiculous_fish 5 years ago
  
  Ahh thanks, shame on me for not reading to the end.
  
  1 reply →

mattst88 5 years ago

> And that was the problem – the branch predictor would sometimes cause xdcbt instructions to be speculatively executed and that was just as bad as really executing them. One of my coworkers (thanks Tracy!) suggested a clever test to verify this – replace every xdcbt in the game with a breakpoint. This achieved two things: > > 1. The breakpoints were not hit, thus proving that the game was not executing xdcbt instructions. > 2. The crashes went away.

I love the simplicity and the genius behind this idea.

barbazoo 5 years ago

Would it be possible to tell the CPU not to apply branch prediction in this one case?

fulafel 5 years ago

There's a "branch hint" in PowerPC, I wonder if that's acted on by the Xenon CPU in question.
edit: it's discussed in the comments as well but they don't know either. The author responds: "I can’t remember how PowerPC branch hints work but if the branch hint overrode the branch predictor then it could have avoided the bug."
kmeisthax 5 years ago
No, because the branch predictor is just "on" or "off". You'd be wondering if you can make it not speculatively execute specific instructions. I'm going to infer from the article that there wasn't a way to do that and it was far too late to spin a new CPU revision to prohibit speculatively executed xdcbt.
- anarazel 5 years ago
  
  I assume one could have put a serializing instruction inside the "dangerous" path, which should then lead to the speculative execution path being rolled back before reaching the dangerous instruction. Obviously it'd also be expensive...
  
  3 replies →
netr0ute 5 years ago

You probably could have done the equivalent if it were IA-64, but I suspect you can't do this with plain old branch-prediction mechanisms.