Since pdqsort (an older project of mine) was mentioned, I felt it wouldn't be entirely inappropriate to mention that I've since then collaborated with Lukas Bergdoll to provide two high-quality sort implementations for the Rust standard library, ipnsort (unstable) and driftsort (stable).
So if you use Rust, you get these by simply calling [T]::sort(_unstable). Great performance out of the box :)
On my machine (Apple M2), using the benchmarks from the repository on Apple clang 17 and Rust 1.98 nightly:
Thanks for everything Orson! I know Clang struggled to ship improved sorts for their C++ implementation, so it's a good sign that Rust was able to ship ipnsort and driftsort without too much chaos.
Also, Lukas looked over my `misfortunate` crate (which provides "perverse" implementations of safe Rust traits) and although misfortunate isn't intended for testing he has inspired me to improve the perverse implementations of Ord, not for testing per se but to further illustrate. It occurs to me I should point anybody reading misfortunate's documentation at your/ Lukas' work in case they actually need really nasty tests not just mild perversion.
I looked at your paper[0] and was curious why it was named "drift" sort. Even searching for 'drift' didn't show me. I mainly ask because this is noted as a stable sort and the word 'drift' implies movement; I did not expect it, from the name, to be a stable sort.
It's called driftsort because it's derived from another sort I made, glidesort: https://github.com/orlp/glidesort. Glidesort is a bit faster still for large inputs, however it was too large and complex for inclusion in the standard library, and suffered from code size penalties on small inputs. So driftsort is a slimmed down version more appropriate for general purpose.
for (int i = 0; i < 1000; i++) {
small_numbers[smlen] = numbers[i];
smlen += (numbers[i] < 500);
}
is much faster than the conventional version with a conditional branch:
for (int i = 0; i < 1000; i++) {
if (numbers[i] < 500) {
small_numbers[smlen] = numbers[i];
smlen += 1;
}
}
Been staring at this for a bit, but my brain is not working properly today: could someone please explain how these to loops compute the same value for small_numbers[smlen]?
- the first one (branchless) use the condition to SAVE the correct value (< 500): it temporarily writes any current value to the same index i, always overwriting the previous value, effectively saving it (by moving forward to i+1) only when the value is right (small number).
Downside of this simple function: the last value may be bigger than 500
- the second one use the condition to ADD the value, when it is 100% sure it is a correct small number
They don't. After running, for the values in small_numbers from 0 to smlen-1 they are equivalent.
But if the last value of numbers[] is not smaller than 500, small_numbers[smlen] will contain that value for the first version whereas the second version does not write to small_numbers[smlen].
At what sequence point? The branchless version writes to small_numbers[smlen], for any given value of smlen, potentially more than once; so there are observable points of time during the loop where the behavior is different. But after the loop, both contain the final write to small_numbers[i] for all 0 <= I < smlen; and the transient writes both don't change observed external behavior, and are apparently cheaper than fewer but conditional writes.
Writing to array[n] and not incrementing n means that the value just written is outside the "useful" range (from 0 to n-1) and will not be considered (it will be overwritten the next iteration).
It doesn't convert bogosort into heapsort either, despite the second being much faster than the first. I'm guessing that it's not that easy going from one to the other because the only thing they have in common is the output (and only after you have checked the last value), so if the transformation is not hard-coded into the compiler, the odds of it randomly discovering the optimization is close to zero
The two code snippets do different things, apples and oranges... e.g. the array modification in the second example needs to move in front of the if for the two snippets to behave identically. I bet then the compiler output is the same with -O1 or higher.
PS: e.g. note how bla() (first code snippet) and blob() (fixed second code snippet) have identical output (both are turned into the same 'branchless' code via a conditional 'setl' instruction), but the blub() function (original second code snippet) differs because that function has different behaviour:
TL;DR: most 'branchless advice' that only tinkers with language features (like "x = a ? b : c" instead of an if) is useless because to the optimizer passes both are the same thing (a condition).
When there's a difference in the generated code then it's usually a bug and the before-after code are not actually equivalent (like in the code examples above).
It's unfortunate that the C++ version of the code assumes the type T is default-constructible (and that the default constructor is cheap). It also assumes that the type T is copy-constructible; at a glance I can't tell if the algorithm depends on making a copy in every place that it does make a copy. E.g. in the `heap_sort` helper we have
T k; // default-construct
if (i > 0) k = left[--i]; // copy-assign
This fairly obviously could be replaced with "copy-construct." Could it be replaced with "move-construct"? I don't know.
Again, in `partition_small`, we have
T swbuf[SMALLPART];
which default-constructs a bunch of Ts. I think we're just going to overwrite that memory in a moment anyway, so constructing all those Ts is a waste of cycles; but I'm not sure.
All of my "I don't knows" and "I'm not sures" are due to my own lack of digging into the code; I'm sure one could find out if one really looked.
None of that matters if you're just sorting `int` or the benchmarked `struct entry`. But it matters a great deal if you're taking the README literally and trying to sort "types with higher copy costs [...] (such as strings)".
...Ah, `heap_sort` is used only for trivially copyable types. So my complaint about not distinguishing copy from move is essentially unimportant (matters only in pathological cases that we shouldn't worry about).
But it's perfectly possible for a type to be "trivially copyable" without being "default-constructible." An example of such a type from the STL: `std::reference_wrapper<int>`.
Anyway, looks like a quick fix for this would be to just extend the list of traits on which blqsort is gated (currently `is_trivially_copyable` and `sizeof(T) <= 16`) by adding `is_trivially_default_constructible<T>::value` also.
A function that returns true when one operand is Less Than the other, should be called BLQS_LT. The CMP abbreviation is idiomatic for a function that returns -1,0, or 1.
On what datatype though, e.g. for sorting arbitrary length strings? I think that is if the comparator is expensive, quicksort and variants do not win because they do a constant factor more comparisons
I‘m always a bit envious when I see those branchless styles. In my day job I have the obligation to hit 100% modified condition/decision coverage, and I‘m daydreaming about having just one control flow through everything, in order to save module tests that only test the umpteenth condition combination.
Obviously, readable code wins, but at least once I had the computing time budget to be able to have a central function go straight through by calculating all five or so variations (it was about several kinds of encodings of the output values) and just pick the correct one in the end. That felt good.
> On modern CPUs, avoiding branch misprediction is a key technique to speed up programs.
This is true but it's misleading. The reality is that modern out-of-order superscalar CPUs are so good at branch prediction that it's nearly always better to branch in a tight loop (to allow more ILP) than introduce a data-dependency in a tight loop (which limits ILP). Cf. https://mazzo.li/posts/value-speculation.html, https://yarchive.net/comp/linux/cmov.html
Branchless code should generally be avoided because modern CPUs are not designed to optimize that use case. There are exceptions of course, but those are exceptions.
Branchful only wins via ILP when data becomes good predictable. But since Quicksort partitioning aims for a 50/50 split, it operates in the worst possible zone for a branch predictor. That's why branchless wins here, as proven by the benchmarks.
I was thinking that... When they say "modern CPUs", surely that includes any pipelined CPU? Maybe Pentium 4 era long pipelines in particular. But actual modern CPUs are much better at branch prediction.
But for something like sorting wouldn't the worst case be completely random data which would defeat any kind of branch prediction?
If it wasn't simple, there would be more lines of code to implement the same idea. As it is, he might have had to spend an hour understanding one line to understand that idea (1 line/hr slow), as opposed to spending an hour reading a hundred lines of code (100 line/hr fast) for the same result.
But merely the word simple isn't the best. Replace simple with elegant.
It can be quite hard to fully and correctly understand a small perfect thing.
If it gets 20 different jobs done without 20 different if statements, it's small and elegant, and simple to impliment, but not simple to understand (maybe simple to think you understand it.)
20 if statements you understand immediately and with no effort.
kind of the flip side of pascal's "I would have written a shorter letter, but I did not have the time." if someone does have the time to make the letter short, it'll take longer to read (where "read" means to grasp the subtleties of.)
>On modern CPUs, avoiding branch misprediction is a key technique to speed up programs. This branchless approach:
>>for (int i = 0; i < 1000; i++) {
> small_numbers[smlen] = numbers[i];
> smlen += (numbers[i] < 500);
>}
Excuse my terrible ignorance but isn't there still a branch? If numbers[i] < 500 then 1 else 0?
I would think something like addition plus a bit comparison would avoid said branch. Unless compilers already optimize the code, but then wouldn't they also optimize the naive piece of code?
Nah. (numbers[i] < 500) is an expression which evaluates to true (1) or false (0). Evaluating this doesn't require a branch. There are instructions on modern CPUs to turn this expression into a number without a conditional jump. (cmp (compare), setle (set if comparison was less than or equal), then add).
> then wouldn't they also optimize the naive piece of code?
Great question. They do sometimes!
In general, the problem for compilers is that its not obvious which method would be better in some given piece of code. Most branches are highly predictable. Like, imagine a for loop which counts to 1000. At the end of the loop body, the code branches to see whether we should stay in the loop, or exit the loop. The first 999 times through the loop we keep going - so 99.9% of the time, the branch ends up taking the same path. Its very predictable! CPU designers optimise heavily for this, via branch prediction logic. Highly predictable branches run fast. (This is also why array bounds checking doesn't really hurt performance at all.)
But the branch predictor really struggles when the condition is unpredictable - ie, when a conditional branch is taken about 50% of the time. As is the case in a sorting algorithm.
The compiler has no idea whether any condition in your code is predictable or not. There are hints you can use, but it often defaults to just doing whatever you ask it to do.
Here's what the compiler actually does with the code you quoted. You can see the extra branch + jump for the second version of the code:
I clicked around - for some reason even using __builtin_expect_with_probability, none of the compilers I tried will convert from one version of this code into the other.
If you hoist small_numbers[smlen] = numbers[i] out of the if statement then the compiler will produce nearly the same branchless assembly for both cases (the only difference being compare against 499 followed by setle vs. compare against 500 followed by setl).
Taking a second look I want to say you need to hoist the assignment for the loops to be semantically identical anyways.
At the bottom of the page there's a link, "When ‘if’ slows you down, avoid it" [1], that discusses these exact questions. It's basically what @josephg said, but it also shows the assembly language for each version.
There's no branch in that code either way. The comparison operator outputs a value (which is arithmetic, not a branch), and that value is added unconditionally.
In modern CPUs a mispredicted branch is much more expensive than a memory write.
The unsaid assumption is that the array is filled with random values between 0 and 1000, so the "if" condition has a 50% of chances to be true. The branch is mispredicted 50% of times.
Of course this trick won't work when the statement protected by "if" is a more complex and costly action, or one that can't be undone (in the example, note that when the counter is not incremented, the value written to memory will be overwritten in the next cycle, so it's "undone" in a certain way).
> In modern CPUs a mispredicted branch is much more expensive than a memory write.
Mostly because of caching. The writes either go to the same address as a previous one or move only a small increment, so most writes are likely going to hit L1 cache. If it wrote to a random memory location after every iteration the cost of a misprediction would probably disappear in the noise.
Modern processors are pipelined, where they run a lot in advance of when the result is needed. A mispredicted branch requires throwing out all the advance calculations on the incorrectly followed path. The branch predictor can't predict branches like this based on data that tends to equally be distributed for taken and not taken. The memory write is fine because it's not conditional, so it can be pipelined along with everything else.
Since pdqsort (an older project of mine) was mentioned, I felt it wouldn't be entirely inappropriate to mention that I've since then collaborated with Lukas Bergdoll to provide two high-quality sort implementations for the Rust standard library, ipnsort (unstable) and driftsort (stable).
So if you use Rust, you get these by simply calling [T]::sort(_unstable). Great performance out of the box :)
On my machine (Apple M2), using the benchmarks from the repository on Apple clang 17 and Rust 1.98 nightly:
And now for a cool party trick, let's repeat the 50 million doubles experiment again, but have the first 90% already sorted, last 10% random:
Thanks for everything Orson! I know Clang struggled to ship improved sorts for their C++ implementation, so it's a good sign that Rust was able to ship ipnsort and driftsort without too much chaos.
Also, Lukas looked over my `misfortunate` crate (which provides "perverse" implementations of safe Rust traits) and although misfortunate isn't intended for testing he has inspired me to improve the perverse implementations of Ord, not for testing per se but to further illustrate. It occurs to me I should point anybody reading misfortunate's documentation at your/ Lukas' work in case they actually need really nasty tests not just mild perversion.
This is very impressive work.
I looked at your paper[0] and was curious why it was named "drift" sort. Even searching for 'drift' didn't show me. I mainly ask because this is noted as a stable sort and the word 'drift' implies movement; I did not expect it, from the name, to be a stable sort.
It's called driftsort because it's derived from another sort I made, glidesort: https://github.com/orlp/glidesort. Glidesort is a bit faster still for large inputs, however it was too large and complex for inclusion in the standard library, and suffered from code size penalties on small inputs. So driftsort is a slimmed down version more appropriate for general purpose.
is much faster than the conventional version with a conditional branch:
Been staring at this for a bit, but my brain is not working properly today: could someone please explain how these to loops compute the same value for small_numbers[smlen]?
Here is another perspective:
- the first one (branchless) use the condition to SAVE the correct value (< 500): it temporarily writes any current value to the same index i, always overwriting the previous value, effectively saving it (by moving forward to i+1) only when the value is right (small number). Downside of this simple function: the last value may be bigger than 500
- the second one use the condition to ADD the value, when it is 100% sure it is a correct small number
They don't. After running, for the values in small_numbers from 0 to smlen-1 they are equivalent.
But if the last value of numbers[] is not smaller than 500, small_numbers[smlen] will contain that value for the first version whereas the second version does not write to small_numbers[smlen].
> "these two loops compute the same value"
At what sequence point? The branchless version writes to small_numbers[smlen], for any given value of smlen, potentially more than once; so there are observable points of time during the loop where the behavior is different. But after the loop, both contain the final write to small_numbers[i] for all 0 <= I < smlen; and the transient writes both don't change observed external behavior, and are apparently cheaper than fewer but conditional writes.
First version has a side effect of writing to small_numbers[0] always.
The compiler probability can't optimize that in the second version.
If it wrote unconditionally and incremented only in the if then I'd guess they would compile to the same thing.
Writing to array[n] and not incrementing n means that the value just written is outside the "useful" range (from 0 to n-1) and will not be considered (it will be overwritten the next iteration).
I am rather thinking, if one is so much faster, and they are truly equal, why is the compiler too stupid to convert one into the other?
It doesn't convert bogosort into heapsort either, despite the second being much faster than the first. I'm guessing that it's not that easy going from one to the other because the only thing they have in common is the output (and only after you have checked the last value), so if the transformation is not hard-coded into the compiler, the odds of it randomly discovering the optimization is close to zero
1 reply →
The two code snippets do different things, apples and oranges... e.g. the array modification in the second example needs to move in front of the if for the two snippets to behave identically. I bet then the compiler output is the same with -O1 or higher.
PS: e.g. note how bla() (first code snippet) and blob() (fixed second code snippet) have identical output (both are turned into the same 'branchless' code via a conditional 'setl' instruction), but the blub() function (original second code snippet) differs because that function has different behaviour:
https://www.godbolt.org/z/h9Kfbn5bc
TL;DR: most 'branchless advice' that only tinkers with language features (like "x = a ? b : c" instead of an if) is useless because to the optimizer passes both are the same thing (a condition).
When there's a difference in the generated code then it's usually a bug and the before-after code are not actually equivalent (like in the code examples above).
1 reply →
It only increments if the number was less than 500, effectively just saving the ones less than 500.
numbers[i] < 500
is a conditional (true or false) that evaluates to 1 or 0 (in C)
Therefore smlen has either a 0 or a 1 added to it's value .. equivilent to only adding 1 if True.
It's unfortunate that the C++ version of the code assumes the type T is default-constructible (and that the default constructor is cheap). It also assumes that the type T is copy-constructible; at a glance I can't tell if the algorithm depends on making a copy in every place that it does make a copy. E.g. in the `heap_sort` helper we have
This fairly obviously could be replaced with "copy-construct." Could it be replaced with "move-construct"? I don't know. Again, in `partition_small`, we have
which default-constructs a bunch of Ts. I think we're just going to overwrite that memory in a moment anyway, so constructing all those Ts is a waste of cycles; but I'm not sure.
All of my "I don't knows" and "I'm not sures" are due to my own lack of digging into the code; I'm sure one could find out if one really looked.
None of that matters if you're just sorting `int` or the benchmarked `struct entry`. But it matters a great deal if you're taking the README literally and trying to sort "types with higher copy costs [...] (such as strings)".
...Ah, `heap_sort` is used only for trivially copyable types. So my complaint about not distinguishing copy from move is essentially unimportant (matters only in pathological cases that we shouldn't worry about).
But it's perfectly possible for a type to be "trivially copyable" without being "default-constructible." An example of such a type from the STL: `std::reference_wrapper<int>`.
Anyway, looks like a quick fix for this would be to just extend the list of traits on which blqsort is gated (currently `is_trivially_copyable` and `sizeof(T) <= 16`) by adding `is_trivially_default_constructible<T>::value` also.
Author here. No, it's also called from the non_trivially_copyable branch (as a fallback). I'll fix that.
why such love for copies tho?
why look for trivial copy and not trivial move?
Nitpicking the C variant:
> #define BLQS_CMP(a, b) ((a) < (b))
A function that returns true when one operand is Less Than the other, should be called BLQS_LT. The CMP abbreviation is idiomatic for a function that returns -1,0, or 1.
On what datatype though, e.g. for sorting arbitrary length strings? I think that is if the comparator is expensive, quicksort and variants do not win because they do a constant factor more comparisons
Aren't there several bitonic sort network implementations that are vectorized, Intel's in particular?
Why not compare against that?
Great question. It would also be fair to ask how this behaves with non-random inputs. The benchmarks in the repo only use random values.
Funny: you can cf "sorting network", and see they use them within their own design even.
I‘m always a bit envious when I see those branchless styles. In my day job I have the obligation to hit 100% modified condition/decision coverage, and I‘m daydreaming about having just one control flow through everything, in order to save module tests that only test the umpteenth condition combination.
Obviously, readable code wins, but at least once I had the computing time budget to be able to have a central function go straight through by calculating all five or so variations (it was about several kinds of encodings of the output values) and just pick the correct one in the end. That felt good.
> On modern CPUs, avoiding branch misprediction is a key technique to speed up programs.
This is true but it's misleading. The reality is that modern out-of-order superscalar CPUs are so good at branch prediction that it's nearly always better to branch in a tight loop (to allow more ILP) than introduce a data-dependency in a tight loop (which limits ILP). Cf. https://mazzo.li/posts/value-speculation.html, https://yarchive.net/comp/linux/cmov.html
Branchless code should generally be avoided because modern CPUs are not designed to optimize that use case. There are exceptions of course, but those are exceptions.
Branchful only wins via ILP when data becomes good predictable. But since Quicksort partitioning aims for a 50/50 split, it operates in the worst possible zone for a branch predictor. That's why branchless wins here, as proven by the benchmarks.
I was thinking that... When they say "modern CPUs", surely that includes any pipelined CPU? Maybe Pentium 4 era long pipelines in particular. But actual modern CPUs are much better at branch prediction.
But for something like sorting wouldn't the worst case be completely random data which would defeat any kind of branch prediction?
It is so simple that I had to look very slowly to understand. Nicely done.
If it wasn’t simple you could look fast and understand?
If it wasn't simple, there would be more lines of code to implement the same idea. As it is, he might have had to spend an hour understanding one line to understand that idea (1 line/hr slow), as opposed to spending an hour reading a hundred lines of code (100 line/hr fast) for the same result.
Yes.
But merely the word simple isn't the best. Replace simple with elegant.
It can be quite hard to fully and correctly understand a small perfect thing.
If it gets 20 different jobs done without 20 different if statements, it's small and elegant, and simple to impliment, but not simple to understand (maybe simple to think you understand it.)
20 if statements you understand immediately and with no effort.
kind of the flip side of pascal's "I would have written a shorter letter, but I did not have the time." if someone does have the time to make the letter short, it'll take longer to read (where "read" means to grasp the subtleties of.)
>On modern CPUs, avoiding branch misprediction is a key technique to speed up programs. This branchless approach:
>>for (int i = 0; i < 1000; i++) {
> small_numbers[smlen] = numbers[i];
> smlen += (numbers[i] < 500);
>}
Excuse my terrible ignorance but isn't there still a branch? If numbers[i] < 500 then 1 else 0? I would think something like addition plus a bit comparison would avoid said branch. Unless compilers already optimize the code, but then wouldn't they also optimize the naive piece of code?
Nah. (numbers[i] < 500) is an expression which evaluates to true (1) or false (0). Evaluating this doesn't require a branch. There are instructions on modern CPUs to turn this expression into a number without a conditional jump. (cmp (compare), setle (set if comparison was less than or equal), then add).
> then wouldn't they also optimize the naive piece of code?
Great question. They do sometimes!
In general, the problem for compilers is that its not obvious which method would be better in some given piece of code. Most branches are highly predictable. Like, imagine a for loop which counts to 1000. At the end of the loop body, the code branches to see whether we should stay in the loop, or exit the loop. The first 999 times through the loop we keep going - so 99.9% of the time, the branch ends up taking the same path. Its very predictable! CPU designers optimise heavily for this, via branch prediction logic. Highly predictable branches run fast. (This is also why array bounds checking doesn't really hurt performance at all.)
But the branch predictor really struggles when the condition is unpredictable - ie, when a conditional branch is taken about 50% of the time. As is the case in a sorting algorithm.
The compiler has no idea whether any condition in your code is predictable or not. There are hints you can use, but it often defaults to just doing whatever you ask it to do.
Here's what the compiler actually does with the code you quoted. You can see the extra branch + jump for the second version of the code:
https://c.godbolt.org/z/zv7Tcd49f
I clicked around - for some reason even using __builtin_expect_with_probability, none of the compilers I tried will convert from one version of this code into the other.
If you hoist small_numbers[smlen] = numbers[i] out of the if statement then the compiler will produce nearly the same branchless assembly for both cases (the only difference being compare against 499 followed by setle vs. compare against 500 followed by setl).
Taking a second look I want to say you need to hoist the assignment for the loops to be semantically identical anyways.
At the bottom of the page there's a link, "When ‘if’ slows you down, avoid it" [1], that discusses these exact questions. It's basically what @josephg said, but it also shows the assembly language for each version.
[1] https://tiki.li/blog/branchless
There's no branch in that code either way. The comparison operator outputs a value (which is arithmetic, not a branch), and that value is added unconditionally.
Isn’t there an implicit check to exit the loop?
5 replies →
Anyone interested in branch-free code might like the book Hacker's Delight. Lots of examples of stuff like this in there.
Highly recommend. The exampels are in c/c++ and the same concepts can be ported to other languages like golang.
My favorite part of it is Chapter 2 the bit manipulation tricks.
If it's branch predicting, why would if statement run slow? How come unnecessary memory write is fine?
In modern CPUs a mispredicted branch is much more expensive than a memory write.
The unsaid assumption is that the array is filled with random values between 0 and 1000, so the "if" condition has a 50% of chances to be true. The branch is mispredicted 50% of times.
Of course this trick won't work when the statement protected by "if" is a more complex and costly action, or one that can't be undone (in the example, note that when the counter is not incremented, the value written to memory will be overwritten in the next cycle, so it's "undone" in a certain way).
> In modern CPUs a mispredicted branch is much more expensive than a memory write.
Mostly because of caching. The writes either go to the same address as a previous one or move only a small increment, so most writes are likely going to hit L1 cache. If it wrote to a random memory location after every iteration the cost of a misprediction would probably disappear in the noise.
Modern processors are pipelined, where they run a lot in advance of when the result is needed. A mispredicted branch requires throwing out all the advance calculations on the incorrectly followed path. The branch predictor can't predict branches like this based on data that tends to equally be distributed for taken and not taken. The memory write is fine because it's not conditional, so it can be pipelined along with everything else.
[dead]