Comment by wolf550e

10 years ago

This is Mike Pall's post: http://article.gmane.org/gmane.comp.lang.lua.general/75426

Has this been fixed in GCC in the last 4 years? Can GCC 5.1 beat Mike Pall? It would be very significant if it were true. Same thing with BLAS.

8 comments

wolf550e

bedatadriven 10 years ago

I think this conversation is incomplete without considering the economics of the Mike Palls of the world. Yes, there exist experts who can write hand-tuned assembly that dramatically outperform optimizing compilers. More importantly, a well-designed algorithm can outperform an optimized version of a poorly-designed algorithm by several orders of magnitude; far beyond what we can expect from constant folding and function inlining.

You'll note that the domains to which these experts apply themselves tend to be those which affect millions of people and are critical to broad swaths of industry: everybody needs a faster BLAS implementation, and the economic impact of faster TLS implementations measures in the hundreds of millions of dollars [1].

But there is a very long tail of people writing performance-sensitive software for domains which cannot attract the time and attention of experts but for whom GCC and the JVM, for example, deliver tremendous gains.

In many cases, you might say the metric to optimize is not battery usage or throughput, but rather 'thinking time'. If a biologist is held ransom to the attention of one of a handful of experts who is capable of hand-optimizing a novel data-processing technique, that biologist will be significantly handicapped in what they can measure or explore. An optimizing compiler that can make their proof-of-concept code run "fast enough" is quite powerful.

[1] https://people.freebsd.org/~rrs/asiabsd_2015_tls.pdf

wolf550e 10 years ago

Saying GCC can outperform experts is incorrect. Saying that every project can afford an expert that can outperform GCC is incorrect.
You can optimize for minimum CO2 emission and then brain power will be spent on problems that waste most CPU cycles.
I don't think even DJB is arguing that optimizing compilers don't bring a benefit. I think DJB says that if you want a secure hypervisor/OS/browser, you should hand code the performance critical parts (e.g. bulk encryption and authentication for TLS, handshake for TLS, video decode, screen composition) and compile the rest with CompCert. If you can add more optimizations to CompCert without affecting its correctness, that's nice. But DJB does not want to run the output of GCC -O3 instead of the output of CompCert for the cold code, and he doesn't want to run the output of GCC -O3 instead of assembly by DJB, Adam Langely, Mike Pall, Jason Garrett-Glaser, etc. for the hot code.

DannyBee 10 years ago

Let's take it piece by piece

" * Diamond-shaped control-flow is known to be the worst-case scenario for most optimizations and for register alloction. Nested diamond-shaped control-flow is even worse. "

Without more info, i can't possibly agree or disagree with this.

Diamonds are definitely not the worst case for register allocations?

" * The compiler doesn't have enough hints to see what the fast paths and what the slow paths are. Even if you'd be able to tell it, it still sees this as a single giant control-flow graph."

With profiling info, dynamic or static, this is wrong.

" Anything in this loop could potentially influence anything else, so almost nothing can be hoisted or eliminated. The slow paths kill the opportunities for the fast paths and the complex instructions kill the opportunities for the simpler instructions."

This is not accurate, First it assumes no optimizations or alias analysis is path sensitive, which is false.

Second, even in those cases where it is true, a smart compiler will simply insert runtime aliasing tests for what it can't prove. They do this already. A JIT (again, another type of compiler, just not static) would not even need this.

Let's skip his hand coding and get to what he says the advantages are:

"* Keep a fixed register assignment for all instructions.

* Keep everything in registers for the fast paths. Spill/reload only in the slow paths.

* Move the slow paths elsewhere, to help with I-Cache density."

All of these things are done by optimizing compilers, already, given profiling info.

If his statement about what he thinks the advantages are is accurate , and what we do, i'm 100% positive that GCC + FDO either already does, or could be tuned without a ton of work, to beat Mike Pall for this kind of loop.

haberman 10 years ago
> If his statement about what he thinks the advantages are is accurate , and what we do, i'm 100% positive that GCC + FDO either already does, or could be tuned without a ton of work, to beat Mike Pall for this kind of loop.
I love the gauntlet-throwing fire in this statement, and I only wish that it was practical to arrange for a showdown on this to actually happen. Especially since it would do nothing but improve the state of the art on both sides, and be a great case study in the capabilities and limitations of compilers.
Unfortunately I think it is impractical to actually have an apples-to-apples showdown because of the fact that the two interpreters use different byte-code. Writing a pure-C interpreter for LuaJIT would be non-trivial.
- gnufx 10 years ago
  
  > I love the gauntlet-throwing fire in this statement, and I only wish that it was practical to arrange for a showdown on this to actually happen.
  Yes, but it's already available in a simpler case, as eluded to above and in discussion on the preview of this talk. There wasn't an answer to the question about which compiler is competitive with OpenBLAS on the linpack benchmark in Fortran (dominated by the triply-nested matrix multiplication loop).
  A typical computational chemistry program of the type that consumes many cycles on HPC systems might use three types of kernel in assembler for good reason, e.g. OpenBLAS, FFTW, and ELPA (scalable eigenvalues). This is unfortunate, but at least a few people doing the work can have a large benefit; many thanks to them.
wolf550e 10 years ago
I would love to see the results. I suggest contacting him either directly or through the mailing list for a representative sample of a C interpreter that can be benchmarked against luajit. http://luajit.org/contact.html
- DannyBee 10 years ago
  
  To be frank, if it wasn't interesting enough for him to ever file a bug against a compiler for, and he's already got something he likes, i don't see why i should take the time to produce results? I mean, what would the point be? He's obviously happy with what he's got (and more power to him).
  Even if i did it, outside of supporting an argument on the internet, what changes?
  
  1 reply →