Comment by computerbuster

1 year ago

Another resource on the same topic: https://blogs.gnome.org/rbultje/2017/07/14/writing-x86-simd-...

As I'm seeing in the comments here, the usefulness of handwritten SIMD ranges from "totally unclear" to "mission critical". I'm seeing a lot on the "totally unclear" side, but not as much on the "mission critical", so I'll talk a bit about that.

FFmpeg is a pretty clear use case because of how often it is used, but I think it is easier to quantify the impact of handwriting SIMD with something like dav1d, the universal production AV1 video decoder.

dav1d is used pretty much everywhere, from major browsers to the Android operating system (superseding libgav1). A massive element of dav1d's success is its incredible speed, which is largely due to how much of the codebase is handwritten SIMD.

While I think it is a good thing that languages like Zig have built-in SIMD support, there are some use cases where it becomes necessary to do things by hand because even a potential performance delta is important to investigate. There are lines of code in dav1d that will be run trillions of times in a single day, and they need to be as fast as possible. The difference between handwritten & compiler-generated SIMD can be up to 50% in some cases, so it is important.

I happen to be somewhat involved in similar use cases, where things I write will run a lot of times. To make sure these skills stay alive, resources like the FFmpeg school of assembly language are pretty important, in my opinion.

One of the fun things about dav1d is that since it’s written in assembly, they can use their own calling convention. And it can differ from method to method, so they have very few stack stores and loads compared to what a compiler will generate following normal platform calling conventions.

  • I'm curious why there are even function calls in time-critical code, shouldn't just about everything be inlined there? And if it's not time-critical, why are we interested in the savings from a custom calling convention?

    • Binary size was a concern, so excessive inlining was undesirable.

      And don't forget that any asm-optimized variant always has a C fallback for generic platforms lacking a hand-optimized variant which is also used to verify the asm-optimized variant using checkasm. This might not be linked into your binary/library (the linker eliminated it because it's never used), but the code exists nonetheless.

      1 reply →

    • Function calls are very fast (unless there's really a lot of parameter copying/saving-to-stack) and if you can re-use a chunk of code from multiple places, you'll reduce pressure on the instruction cache. Inlining is not always ideal.

      1 reply →

    • Codecs often have many redundant ways of doing the same thing, which are chosen on the basis of which one uses the fewest bits, for a specific piece of data. So you can't inline them as you don't know ahead of time which will be used.

  • Doesn’t this just make it harder to maintain ports to other architectures though?

    • For what's written in assembly, lack of portability is a given. The only exceptions would presumably be high level entry points called to from C, etc. If you wanted to support multiple targets, you have completely separate assembly modules for each architecture at least. You'd even need to bifurcate further for each simd generation (within x64 for example).

    • Yes, but on projects like that, ease of maintenance is a secondary priority when compared to performance or throughput.

    • There indeed have been bugs caused by amd64 assembly code assuming unix calling convention being used for Windows builds and causing data corruption. You have to be careful.

I'm also in the mission-critical camp, with perhaps an interesting counterpoint. If we're focusing on small details (or drowning in incidental complexity), it can be harder to see algorithmic optimizations. Or the friction of changing huge amounts of per-platform code can prevent us from escaping a local minimum.

Example: our new matmul outperforms a well-known library for LLM inference, sometimes even if it uses AMX vs our AVX512BF16. Why? They seem to have some threading bottleneck, or maybe it's something else; hard to tell with a JIT involved.

This would not have happened if I had to write per-platform kernels. There are only so many hours in the day. Writing a single implementation using Highway enabled exploring more of the design space, including a new kernel type and an autotuner able to pick not only block sizes, but also parallelization strategies and their parameters.

Perhaps in a second step, one can then hand-tune some parts, but I sure hope a broader exploration precedes micro-optimizing register allocation and calling conventions.

  • > I sure hope a broader exploration precedes micro-optimizing register allocation and calling conventions.

    It should be obvious that both are pursued independently whenever it makes sense. The idea that one should precede the other or is more important than the other is simply untrue.

    • How can tuning be independent of devising the algorithm?

      Are you really suggesting writing a variant of a kernel, tuning it to the max, then discovering a new and different way to do it, and then discarding the first implementation? That seems like a lot of wasted effort.

What does Zig offer in the way of builtin SIMD support, beyond overloads for trivial arithmetic operations? 90% of the utility of SIMD is outside of those types of simple operations. I like Zig, but my understanding is you have to reach for CPU specific builtins for the vast majority of cases, just like in C/C++.

GCC and Clang support the vector_size attribute and overloaded arithmetic operators on those "vectorized" types, and a LOT more besides -- in fact, that's how intrinsics like _mm256_mul_ps are implemented: `#define _mm256_mul_ps(a,b) (__m256)((v8sf)(a) * (v8sf)(b))`. The utility of all of that is much, much greater than what's available in Zig.

  • Zig ships LLVM's internal generic SIMD stuff, which is fairly common for newish systems languages. If you want dynamic shuffles or even moderately exotic things like maddubs or aesenc then you need to use LLVM intrinsics for specific instructions or asm.

  • I’m also wondering what “built in” even means. Many have SIMD, Vector, Matrix, Quaternions and the like as part of the standard library, but not necessarily as their own keywords. C#/.NET, Java has SIMD by this metric.

    • Java's Panama Vectors are work in progress and are far from being competitive with .NET's implementation of SIMD abstractions, which is mostly on par with Zig, Swift and Mojo.

      You can usually port existing SIMD algorithms from C/C++/Rust to C# with few changes retaining the same performance, and it's practically impossible to do so in Java.

      I feel like C veterans often don't realize how unnecessarily ceremonious platform-specific SIMD code is given the progress in portable abstractions. Unless you need an exotic instruction that does not translate across architectures and/or common patterns nicely, there is little reason to have a bespoke platform-specific path.

      5 replies →

So on point. We do _a lot_ of hand written SIMD on the other side (encoders) as well for similar reasons. In addition on the encoder side it's often necessary to "structure" the problem so you can perform things like early elimination of loops, and especially loads. Compilers simply can not generate autovectorized code that does those kinds of things.