Comment by userbinator

9 years ago

Notice the xchg, stosb and a loop instruction. This was definitely written by a skilled Asm programmer --- I've never seen even a compiler at -Os generate code like that.

This also compels me to "code-golf" the function even more:

     push edi
     mov edi, [esp+8]
     mov ecx, [esp+12]
     jecxz label2
    label1:
     push ecx
     call sub_416352
     stosb
     pop ecx
     test al, al
     loopnz label1
     jecxz label2
     dec edi
     salc
     stosb
    label2:
     pop edi
     ret

Original: 58 bytes; patched: 44; mine: 30.

I've done plenty of patching like this, and indeed the relative "sparseness" of compiler output very often allows the more functional version to be smaller than the original. It's amazing how many instructions the original wastes --- notice how none of ebx, esi, or edi are used, yet they get needlessly pushed and popped; and despite saving those registers so they could be used locally, the compiler perplexingly decided to keep all the local variables on the stack instead. The "jump around a jump", with both of them being the "long" form (for destinations greater than 128 bytes away, not the case here) is equally horrible. This may actually be a case where today's compilers will generate smaller code for the same source.

Note that in 32-bit code, memcpy is typically implemented by first copying blocks of 4 bytes using the movsd (move double word) instruction, while any remaining bytes are then copied using movsb (move byte). This is efficient in terms of performance, but whoever was patching this noticed that some space can be freed by only using movsb, and perhaps sacrificing a nanosecond or two.

On older processors this was true, but since Ivy Bridge a REP MOVSB will essentially be as fast but smaller. Look up "enhanced REP MOVSB" for more information.

Modern compilers aren't that much better at code golf.

I tried equivalent C code and gcc-7.2 gets me 47 bytes, while clang-6.0 only manages 49 bytes (both with -m32 f.c -Os -fomit-frame-pointer).

I have a feeling that size optimization just isn't really important to (at least open source) compiler writers these days. There are more important things, like actual performance, standards compliance and nice diagnostics.

  • It’s intriguing to me that this is so, given that most of the point of writing native code these days (rather than targeting some managed-native system like the CLR) is to optimize hot loops, and one of the best optimizations for hot loops is to get them to fit entirely into a cache line. Do compilers for C/FORTRAN/etc. not have any mode or pragma to indicate that you’re attempting to get this performance benefit from a given function-and-its-dependencies?

How does this go with the often quoted mantra that you can only beat compilers today if you're an extremely skilled asm programmer? Or is the problem you describe just about executable size rather than speed?

  • Optimizing for size is easier because you only have exactly one metric to consider: how many bytes your instructions take.

    When optimizing for speed you have to consider many factors like the relative speed of each instruction, cache behavior (including size of the cachelines, associativity, number of layers, relative speed of the layers...), pipelining, branch prediction, prefetching, whether moving your data to SIMD registers could be worth it, what to inline and what not to inline, what to unroll and what not to unroll, constraint solving to optimize things that can be computed or asserted statically etc...

  • Well, the code wasn't compiled by today's compiler, it was compiled in late 2000. Visual Studio 6 maybe?

    Even today compilers tend not to optimise the function preamble/postamble away. I'm only half in agreement with the mantra: you probably can beat the compiler, but is it worth it?

    There are a few situations where it's genuinely a good idea to write in assembler to be explicit about predictable behaviour. Short security-critical constant-time functions are a good candidate.

  • There are a lot of assembly language instructions that do slightly different things than standard C++ or C, but if the programmer is aware of them they can "handle" the differences.

    For example, the xchg instruction doesn't have any C equivalent. (although it has a C++ equivalent: std::swap) The programmer may see:

        A ^= B; B ^= A; A ^= B
    

    These two are swapped. A C compiler may be smart enough to know this is an xchg instruction, or it might turn them into xors. Hard to say, really.

    ---------------

    Most of the low hanging fruit have been taken up for sure. Almost every "memcpy" turns into "rep stos" for example (which is the assembly-language equivalent to memcpy).

    A high-level programmer may not know that "memcpy" turns into "rep stos" however, and may emit his own memory copying for-loop.

    At very least, a good optimizing C / C++ programmer needs to know about these little things. They'll let the compiler turn "memcpy" into "rep stos" (for -Os) or AVX memory store instructions respectively instead of writing their own less efficient loops on the matter.

  • Optimising for size is a relatively "obvious" goal, although it still takes a lot of skill to do it well. Optimising for speed is much less obvious however, the x86 architecture is incredibly complex when it comes to working out what code will be faster.

  • Well, it should also be noted that the responsible compiler in this case is at least 17 years old.

  • > How does this go with the often quoted mantra that you can only beat compilers today if you're an extremely skilled asm programmer? Or is the problem you describe just about executable size rather than speed?

    Word 2000, so a 17+ year old compiler. Compilers have gotten a lot better since then.

    Having worked on a compiler team back in the mid 2000s, even then I'd say it was easy for almost anyone to spot areas where a human could optimize more.

    Now days, much less so.

  • This is a case of knowing the rules so well that you know when you can break them.

    Its also a historical artifact from the days when many programmers wrote assembly yet compilers started getting good.

    There’s also an element of avoiding premature optimization: don’t assume the compiler will product slower code or that if it does it will matter in your specific application.

    At the very least you should give the compiler a chance, profile, then hand-tune after you’ve fixed all the low-hanging fruit.

  • That mantra applies to "most" programmers.

    I think he was talking mostly about size.

    Odds are good most programmers tinkering in machine code won't beat the performance of the compiler. That takes experience. It is a good rule of thumb.

    I think it is easier to write smaller (size) code than a compiler, but when you measure performance it will beat you often until you get good. Alignment, x86 tricks... It takes a bit of knowledge to do well.

  • simias has most of it but note also that that file appears to have last been compiled in the early 2000s. Compilers of that era were far less advanced, especially since many large companies were pretty conservative about the optimizations enabled (fixing a bug meant mailing CDs for many customers).

    The general trend is that it's been getting harder and harder to do that easily, which means people want to be more focused — something like OpenSSL can still justify hand-tuned assembly for various processor families because it's a widespread hotspot but as compilers continue to improve the number of places where it's worth the maintenance cost is going to keep shrinking.

    In the early 2000s, the scientific HPC programmers I worked with were careful to maintain a portable C implementation which they could use as a check both for correctness and for an optimization baseline — it wasn't uncommon for a new compiler and/or processor to substantially close the gap relative to a lot of hard manual work.

"Note that in 32-bit code, memcpy is typically implemented by first copying blocks of 4 bytes using the movsd (move double word) instruction, while any remaining bytes are then copied using movsb (move byte)."

Some software authors do not use memcpy().

https://marc.info/?l=djbdns&m=96477313901746&w=2

http://cr.yp.to/lib/byte.html

   #include "byte.h"

   void byte_copy(to,n,from)
   register char *to;
   register unsigned int n;
   register char *from;
   {
     for (;;) {
       if (!n) return; *to++ = *from++; --n;
       if (!n) return; *to++ = *from++; --n;
       if (!n) return; *to++ = *from++; --n;
       if (!n) return; *to++ = *from++; --n;
     }
   }

In fact, REP MOVSD is harder to handle in hardware especially for unaligned locations because you can only interrupt on 4 byte boundries.