Comment by pizlonator

1 year ago

The way interpreters usually achieve this kind of speedup in C++ is using computed goto. Then there’s no calling convention junk on the path from one opcode to the next.

Also, the main reason why interpreters get faster using either the computed goto style or the tail style versus a classic switch loop is that it reduces pressure on the branch predictor since there’s statically one indirect branch per opcode rather than statically just one indirect branch.

21 comments

pizlonator

tylerhou 1 year ago

(As the article claims) even with computed goto register assignment of the most frequently used variables is fragile because the CFG of the function is so complicated.

pizlonator 1 year ago
It's also fragile, in a different way, if you're threading state through tail calls.
In my experience writing computed goto interpreters, this isn't a problem unless you have more state than what can be stored in registers. But then you'd also have that same problem with tail calls.
- haberman 1 year ago
  
  Fallback paths most definitely have more state than what can be stored in registers. Fallback paths will do things like allocate memory, initialize new objects, perform complicated fallback logic, etc. These fallback paths will inevitably spill the core interpreter state.
  The goal is for fast paths to avoid spilling core interpreter state. But the compiler empirically has a difficult time doing this when the CFG is too connected. If you give the compiler an op at a time, each in its own function, it generally does much better.
  
  3 replies →
ufo 1 year ago
Exactly. Computed goto helps with branch prediction, but does not help w.r.t register allocation & other compiler optimizations.
- pizlonator 1 year ago
  
  As I mentioned in another part of the thread - the way you get that under control in a computed goto interpreter (or a switch loop interpreter) is careful use of noinline.
  Also, it probably depends a lot on what you’re interpreting. I’ve written, and been tasked with maintaining, computed goto interpreters for quite a few systems and the top problem was always the branches and never the register pressure. My guess is it’s because all of those systems had good noinline discipline, but it could also just be how things fell out for other reasons.

highfrequency 1 year ago

> there’s statically one indirect branch per opcode rather than statically just one indirect branch.

Could you elaborate on this with a couple more paragraphs? What do you mean by one indirect branch per opcode rather than just one indirect branch? How is this achieved?

hinkley 1 year ago

Instead of a for loop where you hit a switch at the top of the loop, you jump to the code block for next instruction from the end of the previous instruction. That stops you from jumping to the top of the loop and then jumping a second time.
On older CPUs, you’re less likely to hit a pipeline stall. This technique was called “super-pipelining” for that reason. But a few years ago when this was discussed, someone pointed out that’s usually not necessary anymore. That branch predictors can see through double jumps now.
But as I alluded to in another reply, CPU caches are finite, and I have doubts whether in a fully realized interpreter, particularly one living side by side with a JIT, if that microbenchmark is lying to you about how fast the inner loop is under production conditions.
pizlonator 1 year ago
Sure.
Say you write an interpreter like this:
for (;;) { switch (curInst->opcode) { case Inst::Foo: setOperand(curInst->dest, foo(getOperand(curInst->a), getOperand(curInst->b)); curInst += FooSize; break; case Inst::Bar: setOperand(curInst->dest, foo(getOperand(curInst->a), getOperand(curInst->b)); curInst += BarSize; break; ... more stuff like this } }
Then the generated code will have an indirect branch for the switch. This is costly for two reasons:
- Gotta load curInst->opcode, then lookup opcode in a compiler-generated jump table to get the code address of the case statement.
- Gotta indirect jump.
It turns out that the top cost (or even all of the cost) is the indirect jump. Maybe, the indirection is also some cost, at least on some CPUs. But remember, all jumps on modern CPUs are fast if predicted - regardless of the work seemingly required to do the jump. And they are slow if they are not predicted - and that slowness is about much more than the work required to do the jump (the cost is the CPU needing to roll back its speculations).
Reason: the indirect jump prediction the CPU is doing is keyed by the address of the indirect jump instruction. There is one indirect jump in the code above. And for any real program you'll execute, that indirect jump will end up hitting all (or at least most) of the opcodes. Therefore, the CPU has no easy path to predicting the indirect jump. Maybe it'll sometimes get lucky with more sophisticated predictors (like ones that track history and use that to salt the key used to lookup the predicted destinations, or ones that do more fancy stuff, like maybe some neural network to predict).
How do you make the indirect jump faster? Have one indirect jump per instruction! Both the computed goto approach and the tail call approach get you there.
Consider the computed goto version of the interpreter.
Inst_foo_label: setOperand(curInst->dest, foo(getOperand(curInst->a), getOperand(curInst->b)); curInst += FooSize; goto *curInst->handler; Inst_bar_label: setOperand(curInst->dest, foo(getOperand(curInst->a), getOperand(curInst->b)); curInst += BarSize; goto *curInst->handler; ... more stuff like this
In this formulation of the interpreter, there is one indirect jump per instruction, rather than just one for the interpreter. Therefore, we're asking the CPU's predictor to do something simpler: to predict, for each instruction, what the next instruction will be. And then on top of that, the predictor still gets to use its neural network, or history buffer, or whatever. In terms of mathematical probability theory, it's like we're giving the CPU a first-order Markov chain, and that's sure to improve prediction accuracy. Empirically, it improves it a lot and it's a big speed-up. Here's yet another way to think of it. If I asked you to predict, "what's the next instruction", without any knowledge of what the prior one was, then you'd have a hard time - you'd only be able to reason about which instructions are most likely generally. But this computed goto interpreter is instead asking: "if I tell you the instruction I just executed, what's the next instruction", which gives you more leverage. Maybe adds are often followed by moves, for example.
The tail call style also achieves this, because each instruction's handler will have an indirect tail call (literally an indirect jump) to the next instruction, which again, gives the CPU that Markov chain goodness. So let's call both the computed goto style and the tail call style the "Markov style". If you could rewrite the switch statement so that there was one switch statement per instruction (and you could convince the compiler not to combine them into one, lmao) then that would also be a Markov-style interpreter.
As for the cost of indirectly loading from the compiler's switch table, or other issues like pushing and popping registers: in my experimentation, these costs are like drops in the ocean compared to the cost of indirect jumps. Even with the Markov style interpreter, the CPU spends most of its time mispredicting and rolling back. So the details of how the work happens for individual instructions are usually less important than what you do about the prediction of that indirect jump.
- highfrequency 1 year ago
  
  Awesome, thank you for expanding. Now I see the intuition that branch prediction accuracy could be much higher using the knowledge of the last opcode, so this becomes a game of massaging the code to prod the CPU to use more inputs into its branch prediction. Also helpful color on your empirical observation that branch prediction accuracy dominates other factors like switch indirection and loading registers.
  There's one thing I'm still missing. How exactly do we force the CPU to use the last opcode in its branch prediction model? In your first switch example, the CPU "knows" the path it has followed to get to each iteration, so in theory it could use the information of the last node it visited (or the last two nodes, etc.) to aid branch prediction right?
  Related to that: in your second example, what exactly happens in `goto *curInst->handler;`? Doesn't this need to revert back to something like a switch statement which has the same problem? (Unless you are doing polymorphism / dynamic dispatch in that example? Which I assume has some performance penalty that you're saying is dwarfed by the extra branch prediction effectiveness). Analogous to the line in the OP's article that says `MUSTTAIL return dispatch(UPB_PARSE_ARGS);` - doesn't the generic dispatch function need another switch statement? Probably missing a couple obvious things here.
  Lastly - if you have any books/article recommendations that helped you learn some of these intricacies (esp. regarding intuition about which performance quirks matter vs. don't matter) that would be great as well. Thanks!
  
  1 reply →

hinkley 1 year ago

I’ve heard this went out of fashion a while back. That branch predictors got good enough that it’s not necessary anymore.

But I wonder if that stays true as the size of the interpreter increases.

pizlonator 1 year ago
My most recent data says it's still relevant.
It might not matter for very small interpreters, but it does matter for anything substantial.
Definitely worth remeasuring though.
- titzer 1 year ago
  
  Threaded dispatch is worth 15-30% in Wizard's fast interpreter.
- pjmlp 1 year ago
  
  I would advocate that anything substancial is better off with a JIT, even a dumb one e.g. template JIT, we aren't dealing with 8 bit home computers memory constraints any longer, except in some IoT deployments.
  
  3 replies →