> We implement LuaJIT Remake (LJR)[...] using Deegen. Across 44 benchmarks, LJR's interpreter is on average 179% faster than the official PUC Lua interpreter, and 31% faster than LuaJIT's interpreter.
Well, LuaJIT in JIT mode is about factor 3 faster on average than LuaJIT in interpreter mode (depending on the benchmark up to ten times). And LuaJIT in JIT mode is e.g. factor 8 faster on average than PUC Lua 5.1 (see e.g. http://software.rochus-keller.ch/are-we-fast-yet_Lua_results... for more information). So if Degen is factor 2 faster than PUC Lua or factor 1.3 faster than the LuaJIT interpreter, this is not very impressive. But since the LuaJIT interpreter is written in assembler, we might conclude that the speed-up of a manual assembler implementation compared to a generated interpreter is about 30%. Therefore it's no longer worth the effort to implement an interpreter in assembler (even less if we consider cross-platform migration costs). But on the other hand the Degen generated VM is significantly slower than e.g. the Mono VM or CoreCLR in JIT mode (see e.g. https://github.com/rochus-keller/Oberon/blob/master/testcase...).
> we might conclude that the speed-up of a manual assembler implementation compared to a generated interpreter is about 30%[, t]herefore it's no longer worth the effort to implement an interpreter in assembler
You got that backwards. The paper reports Deegen’s generated interpreter is faster than LuaJIT’s handwritten one by 30%. That’s actually pretty impressive—and pretty impressively straightforwardly achieved[1], TL;DR: instruction dispatch via tail calls avoids the pessimized register allocation that you get for a huge monolithic interpreter loop.
# decode next bytecode opcode
movzwl 8(%r12), %eax
# advance bytecode pointer to next bytecode
addq $8, %r12
# load the interpreter function for next bytecode
movq __deegen_interpreter_dispatch_table(,%rax,8), %rax
# dispatch to next bytecode
jmpq *%rax
You may reduce that even further by pre-decoding the bytecode: you replace a bytecode by the address of the its implementation and then do (with GCC extended goto)
"the disassembly of the Deegen-generated interpreter, baseline JIT, and the generated JIT code rivals the assembly code hand-written by assembly experts in state-of-the-art VMs."
Apparently they compare their JIT with the LuaJIT interpreter. I would be impressed if their JIT was 30% faster on average than LuaJIT in JIT mode. The Graal/Truffle generated VMs are much faster (see e.g. http://software.rochus-keller.ch/awfy-bun-summary.ods).
> Graal/Truffle generated VMs are much faster (see e.g. [link]).
Faster than what? I don’t see any mention of any kind of Lua in that table or in the page it mentions. It’d be awesome[1] if Graal could outdo LuaJIT on Lua, and I was initially excited to learn that it did, but I don’t see anything about that there.
[1] Or as awesome as it’s possible to be for something that Oracle evidently intends to patent to the gills, anyway.
> We implement LuaJIT Remake (LJR)[...] using Deegen. Across 44 benchmarks, LJR's interpreter is on average 179% faster than the official PUC Lua interpreter, and 31% faster than LuaJIT's interpreter.
Well, LuaJIT in JIT mode is about factor 3 faster on average than LuaJIT in interpreter mode (depending on the benchmark up to ten times). And LuaJIT in JIT mode is e.g. factor 8 faster on average than PUC Lua 5.1 (see e.g. http://software.rochus-keller.ch/are-we-fast-yet_Lua_results... for more information). So if Degen is factor 2 faster than PUC Lua or factor 1.3 faster than the LuaJIT interpreter, this is not very impressive. But since the LuaJIT interpreter is written in assembler, we might conclude that the speed-up of a manual assembler implementation compared to a generated interpreter is about 30%. Therefore it's no longer worth the effort to implement an interpreter in assembler (even less if we consider cross-platform migration costs). But on the other hand the Degen generated VM is significantly slower than e.g. the Mono VM or CoreCLR in JIT mode (see e.g. https://github.com/rochus-keller/Oberon/blob/master/testcase...).
> we might conclude that the speed-up of a manual assembler implementation compared to a generated interpreter is about 30%[, t]herefore it's no longer worth the effort to implement an interpreter in assembler
You got that backwards. The paper reports Deegen’s generated interpreter is faster than LuaJIT’s handwritten one by 30%. That’s actually pretty impressive—and pretty impressively straightforwardly achieved[1], TL;DR: instruction dispatch via tail calls avoids the pessimized register allocation that you get for a huge monolithic interpreter loop.
[1] https://sillycross.github.io/2022/11/22/2022-11-22/
You may reduce that even further by pre-decoding the bytecode: you replace a bytecode by the address of the its implementation and then do (with GCC extended goto)
1 reply →
They still have register allocation issues:
> Register shuffling to fulfill C calling convention when making a runtime call.
Not sure how common is that in their benchmarks because it's tempting to handle everything frequently used as bytecode.
> Deegen’s generated interpreter is faster than LuaJIT’s handwritten one by 30%
That's exactly what I've written. But apparently I got that wrong with their baseline JIT.
2 replies →
"the disassembly of the Deegen-generated interpreter, baseline JIT, and the generated JIT code rivals the assembly code hand-written by assembly experts in state-of-the-art VMs."
Apparently they compare their JIT with the LuaJIT interpreter. I would be impressed if their JIT was 30% faster on average than LuaJIT in JIT mode. The Graal/Truffle generated VMs are much faster (see e.g. http://software.rochus-keller.ch/awfy-bun-summary.ods).
> Graal/Truffle generated VMs are much faster (see e.g. [link]).
Faster than what? I don’t see any mention of any kind of Lua in that table or in the page it mentions. It’d be awesome[1] if Graal could outdo LuaJIT on Lua, and I was initially excited to learn that it did, but I don’t see anything about that there.
[1] Or as awesome as it’s possible to be for something that Oracle evidently intends to patent to the gills, anyway.
1 reply →
No, they also compare JIT to JIT and theirs is 30% slower. But it's only a baseline JIT and they'll have optimizing soon (tm).
2 replies →