Comment by haberman

10 years ago

> He then goes on to rebut other comments with simple bald assertions (like the luajit author's one) with again, no actual data.

I assure you, the LuaJIT case is real.

Here is data about LuaJIT's interpreter (in assembly) vs. Lua 5.1's interpreter (in C):

http://luajit.org/performance_x86.html

It's true that these are completely different implementations. But Lua 5.1 is one of the fastest dynamic language interpreters already. There is not room to optimize it 1.5-5x further, like you would need to catch LuaJIT's interpreter. And as the link above shows, the LuaJIT 2.0 interpreter beats Mike's own LuaJIT 1.x JIT compiler in some cases.

Mike's post made lots of specific and concrete arguments for why it's hard for C compilers to compete. Most notably, Mike's hand-written interpreter keeps all important data in registers for all fast-paths, without spilling to the stack. My experience looking at GCC output is that it is not nearly so good at this.

Look at luaV_execute() here and tell me that GCC is really going to be able to keep the variable "pc" in a register, without spilling, in all fast paths, between iterations of the loop: http://www.lua.org/source/5.1/lvm.c.html

I don't agree with the talk's overall point, but if you are skeptical about pretty much anything Mike Pall says regarding performance, you need to look harder.

4 comments

haberman

DannyBee 10 years ago

These numbers appear to be done are without any profiling data, while the hand optimized version has in fact, had profiling data guiding it (the human profiled it).

Give me numbers with profile data, and file bugs about the differences in assembly generation, and i bet it could be pretty easily fixed.

Again, we've done this before for other interpreters.

haberman 10 years ago
> Give me numbers with profile data
Because the interests me, I took a few minutes to try this out.
I ran this test on GCC 4.8.2-19ubuntu1, since it was the newest official release I could get my hands on without compiling my own GCC.
Here are my raw numbers (methodology below):
LuaJIT 2.0.2 (JIT disabled): 1.675s Lua 5.1.5: 5.787s (3.45x) Lua 5.1.5 w/FDO: 5.280s (3.15x) Lua 5.1.5 -O3: 6.536s (3.90x) Lua 5.1.5 -O3 w/FDO: 4.288s (2.56x)
For a benchmark I ran the fannkuch benchmark with N=11 (https://github.com/headius/luaj/blob/master/test/lua/perf/fa...).
My machine is a Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz.
To test LuaJIT with the JIT disabled I ran:
$ time luajit -j off benchmark.lua
To test regular and FDO builds for Lua 5.1.5 I ran (in the "src" directory of a Lua 5.1.5 tree):
$ make all $ time ./lua benchmark.lua $ make clean $ make all MYCFLAGS=-fprofile-arcs MYLIBS=-fprofile-arcs $ ./lua benchmark.lua $ make clean (note: does not delete *.gcda) $ make all MYCFLAGS=-fbranch-probabilities $ time ./lua benchmark.lua
Because Lua's Makefiles use -O2 by default, I edited the Makefile to try -O3 also.
> and file bugs about the differences in assembly generation
It would be pretty hard to file bugs that specific since the two interpreters use different byte-code.
It would be an interesting exercise to write a C interpreter for the LuaJIT bytecode. That would make it easier to file the kinds of performance bugs you were mentioning.
- mikemike 10 years ago
  
  Thank you for taking the time to perform these tests!
  One thing that people advocating FDO often forget: this is statically tuning the code for a specific use case. Which is not what you want for an interpreter that has many, many code paths and is supposed to run a wide variety of code.
  You won't get a 30% FDO speedup in any practical scenario. It does little for most other benchmarks and it'll pessimize quite a few of them, for sure.
  Ok, so feed it with a huge mix of benchmarks that simulate typical usage. But then the profile gets flatter and FDO becomes much less effective.
  Anyway, my point still stands: a factor of 1.1x - 1.3x is doable. Fine. But we're talking about a 3x speedup for my hand-written machine vs. what the C compiler produces. And that's only a comparatively tiny speedup you get from applying domain-specific knowledge. Just ask the people writing video codecs about their opinion on C vector intrinsics sometime.
  I write machine code, so you don't have to. The fact that I have to do it at all is disappointing. Especially from my perspective as a compiler writer.
  But DJB is of course right: the key problem is not the compiler. We don't have a source language that's at the right level to express our domain-specific knowledge while leaving the implementation details to the compiler (or the hardware).
  And I'd like to add: we probably don't have the CPU architectures that would fit that hypothetical language.
  See my ramblings about preserving programmer intent, I made in the past: http://www.freelists.org/post/luajit/Ramblings-on-languages-...
  
  1 reply →