Comment by kierank

1 year ago

I am the author of these lessons.

Ask me anything.

As a user of an ARM Mac, I wonder: how much effort does it take to get such optimized code to work the same in all platforms? I guess you must have very thorough tests and fallback algorithms?

If it's so heavy in assembly, the fact that ffmpeg works on my Mac seems like a miracle. Is it ported by hand?

  • > If it's so heavy in assembly, the fact that ffmpeg works on my Mac seems like a miracle. Is it ported by hand?

    Not ported, but rather re-implemented. So: yes.

    A bit more detail: during build, on x86, the FFmpeg binary would include hand-written AVX2 (and SSSE3, and AVX512, etc.) implementations of CPU-intensive functions, and on Arm, the FFmpeg binary would include hand-written Neon implementations (and a bunch of extensions; e.g. dotprod) instead.

    At runtime (when you start the FFmpeg binary), FFmpeg "asks" the CPU what instruction sets it supports. Each component (decoder, encoder, etc.) - when used - will then set function pointers (for CPU-intensive tasks) which are initialized to a C version, and these are updated to the Neon or AVX2 version depending on what's included in the build and supported by this specific device.

    So in practice, all CPU-intensive tasks for components in use will run hand-written Neon code for you, and hand-written AVX2 for me. For people on obscure devices, it will run the regular C fallback.

  • While the instructions are different, every platform will have some implementation of the basic operations (load, store, broadcast, etc.), perhaps with a different bit width. With those you can write an accelerated baseline implementation, typically (sometimes these are autogenerated/use some sort of portable intrinsics, but usually they don't). If you want to go past that then things get more complicated and you will have specialized algorithms for what is available.

Hi, thanks for your work!

I have a question, as someone who can just about read assembly but still do not intuitively understand how to write or decompose ideas to utilise assembly, do you have any suggestions to learn / improve this?

As in, at what point would someone realise this thing can be sped up by using assembly? If one found a function that would be really performant in assembly how do you go about writing it? Would you take the output from a compiler that's been converted to assembly or would you start from scratch? Does it even matter?

  • You're looking for the tiniest blocks of code that are run an exceptional number of times.

    For instance, I used to work on graphics renderers. You'd find the bit that was called the most (writing lines of pixels to the screen) and try to jiggle the order of the instructions to decrease the number of cycles used to move X bits from system RAM to graphics RAM.

    When I was doing it, branching (usually checking an exit condition on a loop) was the biggest performance killer. The CPU couldn't queue up instructions past the check because it didn't know whether it was going to go true or false until it got there.

    • Don’t modern or even just not ancient cpus use branch prediction to work past a check knowing that the vast majority of the time the check yields the same result?

      2 replies →

  • The best answer to your question is some variant of "write more assembly".

    When someone indicates to me they want to learn programming for example, I ask them how many programs they've written. The answer is usually zero, and in fact I've never even heard greater than 10. No one will answer a larger number because that selects out people who would even ask the question. If you write 1000 programs that solve real problems, you'll be at least okay. 10k and you'll be pretty damn good. 100k and you might be better than the guy who wrote the assembly manual.

    For a fun answer, this is a $20 nand2tetris-esque game that holds your hand through creating multiple cpu architectures from scratch with verification (similarly to prolog/vhdl), plus your own assembly language. I admittedly always end up writing an assembler outside of the game that copies to my clipboard, but I'm pretty fussy about ux and prefer my normal tools.

    https://store.steampowered.com/app/1444480/Turing_Complete/

  • This is one heck of a question.

    I don't know assembly, but my advice would be to take the rote route by rewriting stuff in assembly.

    Just like anything else, there's no quick path to the finish line (unless you're exceptionally gifted), so putting in time is always the best action to take.

What's your perspective on variable-width SIMD instruction sets (like ARM SVE or the RISC-V V extension)? How does developer ergonomics and code performance compare to traditional SIMD? Are we approaching a world with fewer different SIMD instruction sets to program for?

  • Var-width SIMD can mostly be written using the exact same Highway code, we just have to be careful to avoid things like arrays of vectors and sizeof(vector).

    It can be more complicated to write things which are vector-length dependent, such as sorting networks or transposes, but we have always found a way so far.

    On the contrary, there are increasing numbers of ISAs, including the two LoongArch LSX/LASX, AVX-512 which is really really good on Zen5, and three versions of Arm SVE. RISC-V V also has lots of variants and extensions. In such a world, I would not want to have to implement per-platform implementations.

How does FFmpeg generate SEH tables for assembly functions on Windows? Is this something that x86asm.inc handles, or do you guys just not worry about it?

As someone who wrote x86 optimization code professionally in the 90s, do we need to do this manually still in 2025?

Can we not just write tests and have some LLM try 10,000 different algorithms and profile the results?

Or is an LLM unlikely to find the optimal solution even with 10,000 random seeds?

Just asking. Optimizing x86 by hand isn't the easiest, because to think through it you start to have to try and fit all the registers in your mind and work through the combinations. Also you need to know how long each instruction combination will take; and some of these instructions have weird edge cases that take vastly longer or quicker to run that is hard for a human to take into account.

  • I guess your question could be rephrased as "couldn't we come up with better compilers?" (LLM-based or not, brute force based or not).

    I don't have an answer but I believe that a lot of effort has been put in making (very smart) compilers already, so if it's even possible I doubt it's easy.

    I also believe there are some cases where it's simply not possible for a compiler to beat handwritten assembly : indeed there is only so much info you can convey in a C program, and a developer who's aware of the whole program's behavior might be able to make extra assumptions (not written in the C code) and therefore beat a compiler. I'm sure people here would be able to come up with great practical examples of this.

  • While using a LLM might not be the best approach, it would be interesting to know if there are some tools these days that can automate this.

    Like, I should be able to give the compiler a hot loop and a week, and see what it can come up with.

    One potential pitfall I can see is that there are a lot of non-local interactions in moderns systems. We have large out-of-order buffers, many caching layers, complex branch predictors, and an OS running other tasks at the same time, and a dozen other things.

    What is optimal on paper might not be optimal in the real world.

    • > Like, I should be able to give the compiler a hot loop and a week, and see what it can come up with.

      There are optimization libraries which can find the optimum combination of parameters for an objective, like Optuna.

      It would be enough to expose all the optimization knobs that LLVM has, and Optuna will find the optimum for a particular piece of code on a particular test payload.

  • I have tried with Grok3 and Claude. They both seem to have an understanding of the algorithms and data patterns which is more than I expected but then just guess a solution that's often nonsensical.

  • You would need to be very careful about verifying the output. Having an LLM generate patterns and then running them through a SAT solver might work, but usually it's only really feasible for short sequences of code.