Comment by mananaysiempre

10 hours ago

There’s one thing that tail calls do that no other approach to interpreters outside assembly really can, and that is decent register allocation. Current compilers only ever try to allocate registers for a function at a time, and somehow that invariably leads them to do a bad job when given a large blob of a single intepreter function. This is especially true if you don’t isolate your cold paths into separate functions marked uninlineable (and preferably preserve_all or the like). Just look at the assembly and you’ll usually find that it sucks.

(Whether the blob uses computed gotos or loop-switch is less important these days, because Clang [but not GCC] is often smart enough to actually replicate your dispatch in the loop-switch case, avoiding the indirect branch prediction problem that in the past meant computed gotos were preferable. You do need to verify that this optimization actually happens, though, because it can be temperamental sometimes[1].)

By contrast, tail calls with the most important interprerer variables turned into function arguments (that are few enough to fit into registers per the ABI—remember to use regparm or fastcall on x86-32) give the compiler the opportunity to allocate registers for each bytecode’s body separately. This usually allows it to do a much better job, even if putting the cold path out of line is still advisable. (Somehow I’ve never thought to check if it would be helpful to also mark those functions preserve_none on Clang. Seems likely that it would be.)

[1] https://blog.nelhage.com/post/cpython-tail-call/

0 comments

mananaysiempre

No comments yet

Contribute on Hacker News ↗