Comment by pjmlp
3 days ago
Additionally there is still too much performance left on the table by not properly using CPU vector units.
3 days ago
Additionally there is still too much performance left on the table by not properly using CPU vector units.
SIMD performance in modern Intel and AMD cpus is so bad that it is useless outside very specific circumstances.
This is mainly because vector instructions are implemented by sharing resources with other parts of the CPU and more or less stalls pipelines, significantly reduces ipc, makes out of order execution ineffective.
The shared resources are often involve floating point registers and compute, so it's a double whammy.
Yet, it is still faster than not doing nothing, or calling into the GPU, on workloads where the bus traffic takes the majority of execution time.
The comparison is often just plain old linear code.
For example, one simd instruction vs multiple arithmetic instructions.
We have fifty years of CPU design optimizing for this. More often than not, you'll find this works better than vector instructions in practice.
The concept behind vector instructions is great, and it starts to work out for larger widths like 512 bits. But it's extremely tricky to take advantage of that much SIMD with a compiler or manually.
3 replies →