← Back to context

Comment by charleslmunger

3 hours ago

>Beyond that, Intel recently updated their manual to retroactively define the behavior of BSR/BSF on zero inputs: it leaves the destination register unmodified.

This is very nice of them to do, but I found while optimizing a routine in protobuf that BSR is dramatically slower on AMD CPUs than LZCNT, and so I never want to use it again - it's pretty rare to have a function using BSR that can't use CLZ instead, and CLZ is faster on arm, AMD, and equivalent on Intel since haswell.

I believe there is also some errata where on some processors Intel LZCNT had a false dependency for the output register as an input, probably because of this BSR behavior, but compilers will insert a self-xor in loops where that carried dependency would matter.