Comment by grumbelbart2

1 month ago

> (d) AVX512 is quite hardware specific. Some processors execute these instructions much better than others. It's really a collection of extensions, and you get frequency downclocking that's better/worse on different processors as these instructions are executed.

To re-iterate, this is our observation as well. The first AVX512 processors would execute such code quite fast for a short time, then overheat and throttle, leading to a worse wall-time performance than the corresponding AVX256 code.

I am not sure if there is a better way to find the fastest code path besides "measure on the target system", which of course comes with its own challenges.

5 comments

grumbelbart2

adrian_b 1 month ago

The processors with severe throttling from AVX-512 were server CPUs, i.e. Skylake Server and its derivatives, like Cascade Lake and Cooper Lake.

Only few of those CPUs have been used in workstations, i.e. high-end desktop computers.

The vast majority of the CPUs with AVX-512 that can be encountered at the general population are either AMD Zen 4 and Zen 5 CPUs or some old Intel CPUs from the Tiger Lake, Ice Lake and Rocket Lake families. All these do not have AVX-512 throttling problems.

The owners of server computers are more likely to be knowledgeable about them and choose programs compiled with an appropriate target CPU model.

Therefore I believe that nowadays, when the percentage of computers with good AVX-512 support is increasing, and even Intel is expected to introduce by the end of the year Nova Lake with AVX-512 support, an application should be compiled such that whenever it detects AVX-512 support it should dispatch to the corresponding branch.

On the computers with AVX-512 support, using it can provide a significant increase in performance, while the computers where this could be harmful are more and more unlikely to be encountered outside datacenters that have failed to update their servers.

Skylake Server was introduced 9 years ago and Ice Lake Server, which corrected the behavior, was introduced 6 years ago. Therefore, wherever performance matters, the Skylake Server derivatives would have been replaced by now, as a single Epyc server can replace a cluster of servers with Skylake Server CPUs, at a much lower power consumption and with a higher performance.

jandrewrogers 1 month ago

There is a large variance in AVX-512 performance across the microarchitectures, particularly the early ones. In addition to throttling, which was mitigated relatively early, some relatively feature-complete microarchitectures (e.g. Ice Lake) were sensitive to alignment. The microarchitectures with these issues are approaching obsolescence at this point. AVX-512 runs very well with predictable performance on most vaguely recent microarchitectures.

In my case I find AVX-512 to be usable in practice because the kinds of users that are obsessed with performance-engineered systems are typically not running creaky hardware.

Joker_vD 1 month ago

> I am not sure if there is a better way to find the fastest code path besides "measure on the target system", which of course comes with its own challenges.

Yeah, and it's incredibly frustrating because there is almost zero theory on how to write performant code. Will caching things in memory be faster than re-requesting them over network? Who knows! Sometimes it won't! But you can't predict what those times will be beforehand which turns this whole field into pure black magic instead of anything remotely similar to engineering or science, since theoretical knowledge has no relation to reality.

At my last job we had one of the weirdest "memory access is slo-o-o-ow" scenarios I've ever seen (and it would reproduce pretty reliably... after about 20 hours of the service's continuous execution): somehow, due to peculiarities of the GC and Linux physical memory manager, almost all of the working set of our application would end up in a single physical DDR stick, as opposed to being evenly spread across 4 stick the server actually has. Since a single memory stick literally can't cope with such high data throughput, the performance tanked. And it took quite some time us to figure out what the hell was going on because nothing came up on the perf graphs or metrics or whatever: it's just that almost everything in the the application's userspace became slower. No, the CPU is definitely not throttled, it's actually 20–30% idle. No, there is almost zero disk activity, and the network is fine. Huh?!

physicsguy 1 month ago
You do have NUMA to control memory placement, it's not that easy to use though: https://blog.rwth-aachen.de/itc-events/files/2021/02/13-open...
- Joker_vD 1 month ago
  
  Well, we didn't, for obvious reasons, patch JVM to manually finagle with physical memory allocation — which it probably wouldn't be able to do anyway, being run in a container.