Comment by charleslmunger
6 hours ago
>Beyond that, Intel recently updated their manual to retroactively define the behavior of BSR/BSF on zero inputs: it leaves the destination register unmodified.
This is very nice of them to do, but I found while optimizing a routine in protobuf that BSR is dramatically slower on AMD CPUs than LZCNT, and so I never want to use it again - it's pretty rare to have a function using BSR that can't use CLZ instead, and CLZ is faster on arm, AMD, and equivalent on Intel since haswell.
I believe there is also some errata where on some processors Intel LZCNT had a false dependency for the output register as an input, probably because of this BSR behavior, but compilers will insert a self-xor in loops where that carried dependency would matter.
No comments yet
Contribute on Hacker News ↗