Comment by bagxrvxpepzn
14 hours ago
> On modern CPUs, avoiding branch misprediction is a key technique to speed up programs.
This is true but it's misleading. The reality is that modern out-of-order superscalar CPUs are so good at branch prediction that it's nearly always better to branch in a tight loop (to allow more ILP) than introduce a data-dependency in a tight loop (which limits ILP). Cf. https://mazzo.li/posts/value-speculation.html, https://yarchive.net/comp/linux/cmov.html
Branchless code should generally be avoided because modern CPUs are not designed to optimize that use case. There are exceptions of course, but those are exceptions.
Branchful only wins via ILP when data becomes good predictable. But since Quicksort partitioning aims for a 50/50 split, it operates in the worst possible zone for a branch predictor. That's why branchless wins here, as proven by the benchmarks.
I do not agree to call "exceptions" the cases where branchless code is preferable, because they can be quite frequent in certain application domains, like sorting and searching.
The difference between the cases when branches are worse and the cases when they are better, is whether the tested condition is random (i.e. unpredictable) or not.
Whenever you compare a random number with a threshold (or two random numbers between themselves) and use the result for conditional execution, that is an example where using branches is worse.
In most cases, when writing a program it is easy to estimate whether branches will be predictable or not, and in the latter case branchless methods should be used.
I was thinking that... When they say "modern CPUs", surely that includes any pipelined CPU? Maybe Pentium 4 era long pipelines in particular. But actual modern CPUs are much better at branch prediction.
But for something like sorting wouldn't the worst case be completely random data which would defeat any kind of branch prediction?
Branch prediction is indeed necessary for any pipelined CPU. It does not matter whether the CPU also has other more modern features, like being superscalar, having out-of-order execution, etc.
That is why the simplest kind of branch prediction, i.e. static branch prediction, had already been implemented in an early pipelined computer, the IBM Stretch, which was designed around 1959/1960, so it is hardly a "modern" computer.
When the result of a comparison is random, which happens frequently in certain kinds of sorting or searching applications, as you say, that defeats any kind of branch prediction.