Comment by azag0

9 years ago

How does this go with the often quoted mantra that you can only beat compilers today if you're an extremely skilled asm programmer? Or is the problem you describe just about executable size rather than speed?

Optimizing for size is easier because you only have exactly one metric to consider: how many bytes your instructions take.

When optimizing for speed you have to consider many factors like the relative speed of each instruction, cache behavior (including size of the cachelines, associativity, number of layers, relative speed of the layers...), pipelining, branch prediction, prefetching, whether moving your data to SIMD registers could be worth it, what to inline and what not to inline, what to unroll and what not to unroll, constraint solving to optimize things that can be computed or asserted statically etc...

Well, the code wasn't compiled by today's compiler, it was compiled in late 2000. Visual Studio 6 maybe?

Even today compilers tend not to optimise the function preamble/postamble away. I'm only half in agreement with the mantra: you probably can beat the compiler, but is it worth it?

There are a few situations where it's genuinely a good idea to write in assembler to be explicit about predictable behaviour. Short security-critical constant-time functions are a good candidate.

There are a lot of assembly language instructions that do slightly different things than standard C++ or C, but if the programmer is aware of them they can "handle" the differences.

For example, the xchg instruction doesn't have any C equivalent. (although it has a C++ equivalent: std::swap) The programmer may see:

    A ^= B; B ^= A; A ^= B

These two are swapped. A C compiler may be smart enough to know this is an xchg instruction, or it might turn them into xors. Hard to say, really.

---------------

Most of the low hanging fruit have been taken up for sure. Almost every "memcpy" turns into "rep stos" for example (which is the assembly-language equivalent to memcpy).

A high-level programmer may not know that "memcpy" turns into "rep stos" however, and may emit his own memory copying for-loop.

At very least, a good optimizing C / C++ programmer needs to know about these little things. They'll let the compiler turn "memcpy" into "rep stos" (for -Os) or AVX memory store instructions respectively instead of writing their own less efficient loops on the matter.

Optimising for size is a relatively "obvious" goal, although it still takes a lot of skill to do it well. Optimising for speed is much less obvious however, the x86 architecture is incredibly complex when it comes to working out what code will be faster.

Well, it should also be noted that the responsible compiler in this case is at least 17 years old.

> How does this go with the often quoted mantra that you can only beat compilers today if you're an extremely skilled asm programmer? Or is the problem you describe just about executable size rather than speed?

Word 2000, so a 17+ year old compiler. Compilers have gotten a lot better since then.

Having worked on a compiler team back in the mid 2000s, even then I'd say it was easy for almost anyone to spot areas where a human could optimize more.

Now days, much less so.

This is a case of knowing the rules so well that you know when you can break them.

Its also a historical artifact from the days when many programmers wrote assembly yet compilers started getting good.

There’s also an element of avoiding premature optimization: don’t assume the compiler will product slower code or that if it does it will matter in your specific application.

At the very least you should give the compiler a chance, profile, then hand-tune after you’ve fixed all the low-hanging fruit.

That mantra applies to "most" programmers.

I think he was talking mostly about size.

Odds are good most programmers tinkering in machine code won't beat the performance of the compiler. That takes experience. It is a good rule of thumb.

I think it is easier to write smaller (size) code than a compiler, but when you measure performance it will beat you often until you get good. Alignment, x86 tricks... It takes a bit of knowledge to do well.

simias has most of it but note also that that file appears to have last been compiled in the early 2000s. Compilers of that era were far less advanced, especially since many large companies were pretty conservative about the optimizations enabled (fixing a bug meant mailing CDs for many customers).

The general trend is that it's been getting harder and harder to do that easily, which means people want to be more focused — something like OpenSSL can still justify hand-tuned assembly for various processor families because it's a widespread hotspot but as compilers continue to improve the number of places where it's worth the maintenance cost is going to keep shrinking.

In the early 2000s, the scientific HPC programmers I worked with were careful to maintain a portable C implementation which they could use as a check both for correctness and for an optimization baseline — it wasn't uncommon for a new compiler and/or processor to substantially close the gap relative to a lot of hard manual work.