Comment by userbinator

5 years ago

It sounds like a mispredicted xcdbt would actually need to invalidate or otherwise make coherent the cache lines that it mistakenly made incoherent, which also affects the instructions that read the incorrect data, so effectively a full pipeline flush. Even if they got that right, I suspect it would still result in some interesting performance anomalies whenever a mispredicted xcdbt was speculatively executed and then "cancelled".

It's notable that in 2005, it was already near the end for the P4/Netburst with its insanely 31-stage pipeline, and CPU designs were moving towards increasing IPC rather than clock frequency.

4 comments

userbinator

brucedawson 5 years ago

The question of how to properly implement an instruction like xdcbt is interesting. Undoing the damage would be both tricky and expensive. Only doing the L2-skipping when the instruction is executed (as opposed to speculatively executed) would probably be way too late. It seems that such an instruction is probably not practical to implement correctly.

I never asked any of the IBM CPU designers this question (it was too late to make changes so it wasn't relevant) and now I regret that.

boulos 5 years ago
Re-reading the post, it sounds like the conclusion was just “don’t use it” / label it as dangerous. Why didn’t they end up marking the instruction as “can’t speculate this”?
(I can imagine wanting to keep it for straight line unrolled copies that don’t have prediction, but it still seems dicey given that you’d have to write any code with knowledge of the speculative fetches).
- brucedawson 5 years ago
  
  Making the instruction not speculatable would indeed be a hardware change, which there was not time for. So that was not an option.
  And, let's say they did that. All other loads/prefetches are done in the early stages of the pipeline, when execution is speculative. I think they would need new logic at a later stage of the pipeline just for this instruction, in order to initiate a "late prefetch". That is potentially a lot of extra transistors and wires. And, at that point you have a prefetch instruction that doesn't start prefetching until potentially dozens (or more) cycles later. At that point using xdcbt instead of dcbt may just make your code run slower.
  What about, then, an xdcbt which is seen in a context where it is known early on that it will definitely be executed - a context where it is not speculative. Well, there really is no such context. Practically speaking there are so many branches that when an instruction is decoded there is almost always a conditional branch in front of it in the pipeline. And, architecturally speaking, any earlier instruction could trigger an exception which would stop execution flow from reaching the xdcbt. Pipelines are really really deep.
  TL;DR - On heavily pipelined CPUs (even in-order ones) you don't know for sure that an instruction is "real" until it is time to commit its results, and that is way too late for a "prefetch"
- saagarjha 5 years ago
  
  This sounds like something you’d need to fix in hardware.