Comment by physicsguy

1 month ago

A few gentle points:

(a) You mention that the NVidia docs push people to use libraries, etc. to really get high performance CUDA kernels, rather than writing them themselves. My argument would be that SIMD is exactly the same - they're something really that are perfect if you're writing a BLAS implementation but are too low level for most developers thinking about a problem to make use of.

(b) You show a problem where autovectorisation fails because of branching, and jump straight to intrinsics as the solution which you basically say are ugly. Looking at the intrinsic code, you're using a mask to deal with the branching. But there's a middle ground - almost always you would want to try restructuring the problem, e.g. splitting up loops and adding masks where there's conditions - i.e. lean into the SIMD paradigm. This would also be the same advice in CUDA.

(c) As you've found, GCC actually performs quite poorly for x86-64 optimisations compared to Intel. It's not always clear cut though, the Intel Compiler for e.g. sacrifices IEEE 764 float precision and go down to ~14 digits of precision in it's defaults, because it sets the flag `-fp-model=fast -fma`. This is true of both the legacy and new Intel compiler. If you switch to `-fp-model=strict` then you may find that the results are closer.

(d) AVX512 is quite hardware specific. Some processors execute these instructions much better than others. It's really a collection of extensions, and you get frequency downclocking that's better/worse on different processors as these instructions are executed.

12 comments

physicsguy

mort96 1 month ago

Regarding (b), I would never rely on auto vectorization because I have no insight into it. The only way to check if my code is written such that auto-vectorization can do its thing is to compile my program with every combination of compiler, optimization setting and platform I intend to support, disassemble all the resulting binaries, and analyze the disassembly to try to figure out if it did autovectorization in the way I expect. That's a terrible developer experience; writing intrinsics by hand is much easier, and more robust. I'd need to re-check every piece of autovectorized code after every change and after every compiler upgrade.

I just treat autovectorization like I treat every other fancy optimization that's not just constant folding and inlining: nice when it happens to work out, it probably happens to make my code a couple percent faster on average, but I absolutely can't rely on it in any place where I depend on the performance.

physicsguy 1 month ago
> every combination of compiler, optimization setting and platform I intend to support, disassemble all the resulting binaries, and analyze the disassembly to try to figure out if it did autovectorization in the way I expect
I just used to fire up VTune and inspect the hot loops... typically if you care about this you're only really working on hardware targeting the latest instruction sets anyway in my experience. It's only if you're working on low level libraries I would bother doing intrinsics all over the place.
For most consumer software you want to be able to fall back to some lowest-common-denominator hardware anyway otherwise people using it run into issues - same reason that Debian, Conda, etc. only go up to really old instructions sets.
- mort96 1 month ago
  
  I work on games sometimes, where the goal is: "run as fast as possible on everyone's computer, whether it's a 15 year old netbook with an Intel Atom or a freshly built beast of a gaming desktop". As a result, the best approach is to discover supported instructions at runtime and dispatch to a function that's using those instructions (maybe populating a global vector function table at launch?). Next best is to assume some base level vector support (maybe the original AVX for x86, Neon for ARM) and unconditionally use those. Targeting only the latest instruction sets is a complete non-starter.

grumbelbart2 1 month ago

> (d) AVX512 is quite hardware specific. Some processors execute these instructions much better than others. It's really a collection of extensions, and you get frequency downclocking that's better/worse on different processors as these instructions are executed.

To re-iterate, this is our observation as well. The first AVX512 processors would execute such code quite fast for a short time, then overheat and throttle, leading to a worse wall-time performance than the corresponding AVX256 code.

I am not sure if there is a better way to find the fastest code path besides "measure on the target system", which of course comes with its own challenges.

adrian_b 1 month ago

The processors with severe throttling from AVX-512 were server CPUs, i.e. Skylake Server and its derivatives, like Cascade Lake and Cooper Lake.
Only few of those CPUs have been used in workstations, i.e. high-end desktop computers.
The vast majority of the CPUs with AVX-512 that can be encountered at the general population are either AMD Zen 4 and Zen 5 CPUs or some old Intel CPUs from the Tiger Lake, Ice Lake and Rocket Lake families. All these do not have AVX-512 throttling problems.
The owners of server computers are more likely to be knowledgeable about them and choose programs compiled with an appropriate target CPU model.
Therefore I believe that nowadays, when the percentage of computers with good AVX-512 support is increasing, and even Intel is expected to introduce by the end of the year Nova Lake with AVX-512 support, an application should be compiled such that whenever it detects AVX-512 support it should dispatch to the corresponding branch.
On the computers with AVX-512 support, using it can provide a significant increase in performance, while the computers where this could be harmful are more and more unlikely to be encountered outside datacenters that have failed to update their servers.
Skylake Server was introduced 9 years ago and Ice Lake Server, which corrected the behavior, was introduced 6 years ago. Therefore, wherever performance matters, the Skylake Server derivatives would have been replaced by now, as a single Epyc server can replace a cluster of servers with Skylake Server CPUs, at a much lower power consumption and with a higher performance.
jandrewrogers 1 month ago

There is a large variance in AVX-512 performance across the microarchitectures, particularly the early ones. In addition to throttling, which was mitigated relatively early, some relatively feature-complete microarchitectures (e.g. Ice Lake) were sensitive to alignment. The microarchitectures with these issues are approaching obsolescence at this point. AVX-512 runs very well with predictable performance on most vaguely recent microarchitectures.
In my case I find AVX-512 to be usable in practice because the kinds of users that are obsessed with performance-engineered systems are typically not running creaky hardware.
Joker_vD 1 month ago
> I am not sure if there is a better way to find the fastest code path besides "measure on the target system", which of course comes with its own challenges.
Yeah, and it's incredibly frustrating because there is almost zero theory on how to write performant code. Will caching things in memory be faster than re-requesting them over network? Who knows! Sometimes it won't! But you can't predict what those times will be beforehand which turns this whole field into pure black magic instead of anything remotely similar to engineering or science, since theoretical knowledge has no relation to reality.
At my last job we had one of the weirdest "memory access is slo-o-o-ow" scenarios I've ever seen (and it would reproduce pretty reliably... after about 20 hours of the service's continuous execution): somehow, due to peculiarities of the GC and Linux physical memory manager, almost all of the working set of our application would end up in a single physical DDR stick, as opposed to being evenly spread across 4 stick the server actually has. Since a single memory stick literally can't cope with such high data throughput, the performance tanked. And it took quite some time us to figure out what the hell was going on because nothing came up on the perf graphs or metrics or whatever: it's just that almost everything in the the application's userspace became slower. No, the CPU is definitely not throttled, it's actually 20–30% idle. No, there is almost zero disk activity, and the network is fine. Huh?!
- physicsguy 1 month ago
  
  You do have NUMA to control memory placement, it's not that easy to use though: https://blog.rwth-aachen.de/itc-events/files/2021/02/13-open...
  
  1 reply →

shihab 1 month ago

Hi, thanks for reading.

Re (b) I'm curious what that middle ground is. Is there any simple refactor to help GCC to get rid of this `if`? (Note, ISPC did fine here)

Regarding (a), one of the points I wanted to get across was that it didn't feel that complicated to program in the end as I had thought. Porting to AVX-512 felt mechanical (hence the success of LLMs in one-shotting the whole thing).

This is a subjective opinion, depends on programmer's experience etc- so I won't dwell on it. I just wish more CPU programmers gave it a try.

Remnant44 1 month ago

Fort what it's worth, I had the exact same experience you did when I started writing SIMD code explicitly with intrinsics.
I avoided it for a long time because, well, it was so damn ugly and verbose to do simple things. However, in actual practice it's not nearly as painful as it looks, and you get used to it quickly.
physicsguy 1 month ago
The typical way would be to unroll the inner loop manually; often you can get away with:
for (int i = 0; i < N; i += SIMD_WIDTH) { for (int j = 0; j < SIMD_WIDTH) { // do code } }
but failing the compiler optimising that you can do it more like:
for(int i = 0; i < N; i+= SIMD_WIDTH) { float mask[8]; // do work into mask, find max of the mask }
That's effectively what you're doing anyway in the SIMD code, but it keeps it more readable for mere mortals, and because you can define SIMD_WIDTH as a constant, it's also slightly easier to change if a new instruction set comes along; you're not maintaining multiple kernels.