← Back to context

Comment by imglorp

1 year ago

Asm is 10x faster than C? That was definitely true at some point but is it still true today? Have compilers really stagnated so badly they can't come close to hand coded asm?

C with intrinsics can get very close to straight assembly performance. The FFmpeg devs are somewhat infamously against intrinsics (IIRC they don't allow them in their codebase even if the performance is as good as equivalent assembly) but even by TFAs own estimates the difference between intrinsics and assembly is on the order of 10-15%.

You might see a 10x difference if you compare meticulously optimized assembly to naive C in cases where vectorization is possible but the compiler fails to capitalize on that, which is often, because auto-vectorization still mostly sucks beyond trivial cases. It's not really a surprise that expert code runs circles around naive code though.

  • You might see a 10x difference if you compare meticulously optimized assembly to naive C in cases where vectorization is possible but the compiler fails to capitalize on that,

    I can get far more than 10x over naive C just by reordering memory accesses. With SIMD it can be 7x more, but that can be done with ISPC, it doesn't need to be done with asm.

    • > I can get far more than 10x over naive C

      However you can write better than naive C by compiling and watching the compiler output.

      I stopped writing assembly back around y2k as I was fairly consistently getting beaten by the compiler when I wrote compiler-friendly high-level code. Memory organization is also something you can control fairly well on the high-level code side too.

      Sure some niches remained, but for my projects the gains were very modest compared to invested time.

  • "The FFmpeg devs are somewhat infamously against intrinsics (they don't allow them in their codebase even if the performance is as good as equivalent assembly)"

    Why?

    • I don't know if it's their reason but I myself avoid them because I find them harder to read than assembly language.

    • Did you read lesson one?

      TL;DR They want to squeeze every drop of performance out of the CPU when processing media, and maintaining a mixture of intrinsics code and assembly is not worth the trade off when doing 100% assembly offers better performance guarantees, readability, and ease of maintenance / onboarding of developers.

      3 replies →

It's not a matter of compiler stagnation. The compiler simply isn't privy to the information the assembly author makes use of to inform their design.

Put more simply: a C compiler can't infer from a plain C implementation that you're trying to do certain mathematics that could alternately be expressed more efficiently with SIMD intrinsics. It doesn't have access to your knowledge about the mathematics you're trying to do.

There are also target specific considerations. A compiler is, necessarily, a general purpose compiler. Problems like resource (e.g. register) allocation are NP-complete (equivalent to knapsack) and very few people want their compiler to spend hours upon hours searching for the absolute most optimal (if indeed you can even know that statically...) asmgen.

This is for heavily vectorized code, using every hack possible to fully utilize the CPU. Compilers are smart when it comes to normal code, but codecs are not really normal code. Not a ffmpeg programmer, but have some background dealing with audio.

  • > codecs are not really normal code.

    Not really a fair comment. They are entirely normal code in most senses. They differ in one important way: they are (frequently) perfect examples of where "single instruction, multiple data" completely makes sense. "Do this to every sample" is the order of the day, and that is a bit odd when compared with text processing or numerical computation.

    But this is true of the majority of signal processing, not just codecs. As simple a thing as increasing the volume of an audio data stream means multiplying every sample by the same value - more or less the definition of SIMD.

    • There's a difference because audio processing is often "massively parallel", or at least like 1024 samples at once, but in video codecs operations could be only 4 pixels at once and you have to stretch to find extra things to feed the SIMD operations.

      2 replies →

  • > codecs are not really normal code.

    Codecs are pretty normal code. You can get decent performance by just writing quality idiomatic C or C++, even without asm. (I implemented a commercial x.264 codec and worked on a bunch of audio codecs.)

C compilers are still pretty bad at auto vectorization. For problems where SIMD is applicable, you can reasonably expect a 2x-16x speed up over the naive scalar implementation.

  • Also, if you write code with intrinsics the autovectorization can make it _worse_. eg a pattern is to write a SIMD main loop and then a scalar tail, but it can autovectorize that and mess it up.

    • Given the wider availability of masking (AVX-512, RISC-V and SVE), I figure scalar tails are no longer the preferred pattern everywhere.

Probably some very niche things. I know I can't write ASM that's 10x better than C, but I wouldn't assume no one can.

  • It isn't very hard to write C that is 10x better than C, because most programs have too many memory allocations and terrible memory access patterns. Once you sort that out you are already more than 10x ahead, then you can turn on the juice with SIMD, parallelization and possibly optimize for memory bandwidth as well.

  • It depends on what you're trying to do. I would in general only expect such substantial speedups when considering writing computation kernels (for audio, video, etc).

    Compilers today are liable in most circumstances to know many more tricks than you do. Especially if you make use of hints (e.g. "this memory is almost always accessed sequentially", "this branch is almost never taken", etc) to guide it.

    • Oh I definitely agree that in the vast majority of cases the compiler will probably win.

      But I suspect there are cases where the super experts exist who can do things better.

    • Mm, those hints don't matter on modern CPUs. There's no good way for the compiler to pass it down to them either. There are some things like prefetch instructions, but unless you know the exact machine you're targeting, you won't know when to use them.

I highly doubt it's true. I can usually approach the same speed in C if I'm working with a familiar compiler. Sometimes I can do significantly better in assembly but it's rare.

I work on bare metal embedded systems though, so maybe there's some nuance when working with bigger OS libs?

  • The difference is probably that you don’t work in an environment that supports SIMD or your code can’t benefit from it.

    • You're correct, I don't use SIMD instructions much, but I can, and with a C compiler. So still, not sure the advantage of ASM.

This gets even more complex once you start looking at dynamic compilations. Some of the JIT compilers have the ability to hot patch functions based upon runtime statistics. In very large, enterprisey applications with unknowns regarding how they will actually be used at build time, this can make a difference.

You can go nuclear option with your static compilations and turn on all the optimizations everywhere, but this kills inner loop iteration speed. I believe there are aspects of some dynamic compiling runtimes that can make them superior to static compilations - even if we don't care how long the build takes.

  • Statistics aren't magic and it's not going to find superoptimizing cases like this by using them. I think this is only helpful when you get a lot of incoming poorly written/dynamic code needing a lot of inlining, that maybe just got generated in the first place. So basically serving ads on websites.

    In ffmpeg's case you can just always be the correct thing.

I remember a series of lectures from an Intel engineer that went into how difficult it was writing assembly code for x86. He basically stated that the number of cases you can really write code that is faster than what a compiler would do is close to none.

Essentially people think they are writing low level code, in reality that's not how CPUs interpret that code, so he explained how writing manual assembly kills performance pretty much always (at least on modern x86).

  • That's for random "I know asm so it must be faster".

    If you know it really well, have already optimized everything on an algorithmic level and have code that can benefit from simd, 10x is real.

  • You have to consider that modern CPUs don't execute code in-order, but speculatively, in multiple instruction pipelines.

    I've used Intel's icc compiler and profiler tools in an iterative fashion. A compiler like Intel's might be made to profile cache misses, pipeline utilization, branches, stalls, and supposedly improve in the next compilation.

    The assembly programmer has to consider those factors. Sure would be nice to have a computer check those things!

    In the old days, we only worried about cycle counts, wait states, and number of instructions.

  • That's assembly by people who learned it in 1990. Intel very much does want you writing assembly for their processors and in many ways the only way to push them hard is by doing so.

No, that claim is ridiculous. When doing the same task, quite frankly, compilers are much better than any human at optimizing general logic.

But when the human and compiler are not faced with the same problem...

Say, if your compiler doesn't support autovectorization and/or your C code isn't friendly to the idiom, then sure: a 10x difference in performance between a hand-optimized SIMD implementation and a naive scalar one fed to a C compiler is probably about right.