Comment by highfrequency

1 year ago

> there’s statically one indirect branch per opcode rather than statically just one indirect branch.

Could you elaborate on this with a couple more paragraphs? What do you mean by one indirect branch per opcode rather than just one indirect branch? How is this achieved?

Instead of a for loop where you hit a switch at the top of the loop, you jump to the code block for next instruction from the end of the previous instruction. That stops you from jumping to the top of the loop and then jumping a second time.

On older CPUs, you’re less likely to hit a pipeline stall. This technique was called “super-pipelining” for that reason. But a few years ago when this was discussed, someone pointed out that’s usually not necessary anymore. That branch predictors can see through double jumps now.

But as I alluded to in another reply, CPU caches are finite, and I have doubts whether in a fully realized interpreter, particularly one living side by side with a JIT, if that microbenchmark is lying to you about how fast the inner loop is under production conditions.

Sure.

Say you write an interpreter like this:

    for (;;) {
        switch (curInst->opcode) {
        case Inst::Foo:
            setOperand(curInst->dest, foo(getOperand(curInst->a), getOperand(curInst->b));
            curInst += FooSize;
            break;
        case Inst::Bar:
            setOperand(curInst->dest, foo(getOperand(curInst->a), getOperand(curInst->b));
            curInst += BarSize;
            break;
        ... more stuff like this
        }
    }

Then the generated code will have an indirect branch for the switch. This is costly for two reasons:

- Gotta load curInst->opcode, then lookup opcode in a compiler-generated jump table to get the code address of the case statement.

- Gotta indirect jump.

It turns out that the top cost (or even all of the cost) is the indirect jump. Maybe, the indirection is also some cost, at least on some CPUs. But remember, all jumps on modern CPUs are fast if predicted - regardless of the work seemingly required to do the jump. And they are slow if they are not predicted - and that slowness is about much more than the work required to do the jump (the cost is the CPU needing to roll back its speculations).

Reason: the indirect jump prediction the CPU is doing is keyed by the address of the indirect jump instruction. There is one indirect jump in the code above. And for any real program you'll execute, that indirect jump will end up hitting all (or at least most) of the opcodes. Therefore, the CPU has no easy path to predicting the indirect jump. Maybe it'll sometimes get lucky with more sophisticated predictors (like ones that track history and use that to salt the key used to lookup the predicted destinations, or ones that do more fancy stuff, like maybe some neural network to predict).

How do you make the indirect jump faster? Have one indirect jump per instruction! Both the computed goto approach and the tail call approach get you there.

Consider the computed goto version of the interpreter.

    Inst_foo_label:
        setOperand(curInst->dest, foo(getOperand(curInst->a), getOperand(curInst->b));
        curInst += FooSize;
        goto *curInst->handler;
    Inst_bar_label:
        setOperand(curInst->dest, foo(getOperand(curInst->a), getOperand(curInst->b));
        curInst += BarSize;
        goto *curInst->handler;
        ... more stuff like this

In this formulation of the interpreter, there is one indirect jump per instruction, rather than just one for the interpreter. Therefore, we're asking the CPU's predictor to do something simpler: to predict, for each instruction, what the next instruction will be. And then on top of that, the predictor still gets to use its neural network, or history buffer, or whatever. In terms of mathematical probability theory, it's like we're giving the CPU a first-order Markov chain, and that's sure to improve prediction accuracy. Empirically, it improves it a lot and it's a big speed-up. Here's yet another way to think of it. If I asked you to predict, "what's the next instruction", without any knowledge of what the prior one was, then you'd have a hard time - you'd only be able to reason about which instructions are most likely generally. But this computed goto interpreter is instead asking: "if I tell you the instruction I just executed, what's the next instruction", which gives you more leverage. Maybe adds are often followed by moves, for example.

The tail call style also achieves this, because each instruction's handler will have an indirect tail call (literally an indirect jump) to the next instruction, which again, gives the CPU that Markov chain goodness. So let's call both the computed goto style and the tail call style the "Markov style". If you could rewrite the switch statement so that there was one switch statement per instruction (and you could convince the compiler not to combine them into one, lmao) then that would also be a Markov-style interpreter.

As for the cost of indirectly loading from the compiler's switch table, or other issues like pushing and popping registers: in my experimentation, these costs are like drops in the ocean compared to the cost of indirect jumps. Even with the Markov style interpreter, the CPU spends most of its time mispredicting and rolling back. So the details of how the work happens for individual instructions are usually less important than what you do about the prediction of that indirect jump.

  • Awesome, thank you for expanding. Now I see the intuition that branch prediction accuracy could be much higher using the knowledge of the last opcode, so this becomes a game of massaging the code to prod the CPU to use more inputs into its branch prediction. Also helpful color on your empirical observation that branch prediction accuracy dominates other factors like switch indirection and loading registers.

    There's one thing I'm still missing. How exactly do we force the CPU to use the last opcode in its branch prediction model? In your first switch example, the CPU "knows" the path it has followed to get to each iteration, so in theory it could use the information of the last node it visited (or the last two nodes, etc.) to aid branch prediction right?

    Related to that: in your second example, what exactly happens in `goto *curInst->handler;`? Doesn't this need to revert back to something like a switch statement which has the same problem? (Unless you are doing polymorphism / dynamic dispatch in that example? Which I assume has some performance penalty that you're saying is dwarfed by the extra branch prediction effectiveness). Analogous to the line in the OP's article that says `MUSTTAIL return dispatch(UPB_PARSE_ARGS);` - doesn't the generic dispatch function need another switch statement? Probably missing a couple obvious things here.

    Lastly - if you have any books/article recommendations that helped you learn some of these intricacies (esp. regarding intuition about which performance quirks matter vs. don't matter) that would be great as well. Thanks!

    • > How exactly do we force the CPU to use the last opcode in its branch prediction model?

      In the tail call case: because each opcode is implemented by a function that has an indirect tail call at the end.

      In the computed goto case: each `goto curInst->handler` is its own indirect branch.

      > Related to that: in your second example, what exactly happens in `goto curInst->handler;`?

      `curInst->handler` is a pointer that points to some label. The goto is an indirect branch to exactly that pointer value, i.e. that label.

      It's a super crazy and dangerous feature! It obviates the need for a switch; it is the switch.