Comment by CodeArtisan

5 hours ago

    # decode next bytecode opcode
    movzwl      8(%r12), %eax
    # advance bytecode pointer to next bytecode
    addq        $8, %r12
    # load the interpreter function for next bytecode
    movq        __deegen_interpreter_dispatch_table(,%rax,8), %rax
    # dispatch to next bytecode
    jmpq        *%rax

You may reduce that even further by pre-decoding the bytecode: you replace a bytecode by the address of the its implementation and then do (with GCC extended goto)

  goto *program_bytecodes[counter]

I've been playing around with this and its worth noting that pre-decoding the bytecode because it means every instruction (without operands) is the width of a pointer (8 bytes on x86) which means you fit far fewer instructions into cache, eg my opcodes are a byte, so that's 8x more instructions. I haven't had time to compare it in benchmarks to see what the real world difference is, but its worth keeping in mind.

Somewhat off topic, looking at that assembly... mine compiles to (for one of the opcodes):

    movzx  eax,BYTE PTR [rdi]
    lea    r9,[rip+0x1d6fd]        # 2ae30 <instructions_table>
    mov    rax,QWORD PTR [r9+rax*8]
    inc    rdi
    jmp    rax

(also compiled from C++ with clang's musttail annotation)

  • I have wondered whether it's worth storing instruction offsets (from the first instruction) rather than raw instruction pointers to increase cache efficiency, then they could be encoded in just 2 (or at worst 3) bytes. At the cost of an extra register.