Comment by brigade

9 years ago

> High Register and 16bit registers are huge wart that it seems Intel is trying desperately hard to get us to stop using.

Someone really ought to tell clang and gcc this; they both happily use 16-bit registers for 16-bit arithmetic.

Anyway, Intel obviously already has special optimizations for many partial register accesses, dating back to Sandy Bridge. While it's quite possible that they left out the high registers initially (no clue, don't care), if they did they could have decided to include them in Skylake. Who knows though...

What are you even talking about with the LSD? The LSD is entirely before any register renaming and the entire out-of-order bits of the CPU. It's likely the LSD is involved only because that (plus hyperthreading) might be the only way to get enough in-flight µops to trigger whatever is going wrong, whether or not it's due to optimizations for partial register accesses.

Indeed, I've worked with CPUs that didn't have the register split of x86 and they are far less friendly to implementing certain algorithms, which would otherwise require many additional registers and lots of shift/add instructions to achieve the same effect. ("MOV AH, AL" being one simple example -- try doing that on a MIPS, for instance.)

  • How often have you really needed to do "a = (a & 0xffff00ff) | ((a & 0xff) << 8)"? I don't think I've ever needed to do it, and I wouldn't be surprised if compilers don't even generate "mov ah,al" for that, due to the fact that AH/BH/CH/DH only exist for 4 registers.

    Anyway, since you asked: In AArch64 that would be written "bfi w0,w0,#8,#8". "bfi" is an alias of "bfm", an instruction far more flexible and useful than any of the baroque x86 register names. BFM can move an arbitrary number of bits from any position in any register to any other register, and it has optional zero-extension and sign-extension modes.

    • If you're talking to any memory-mapped registers, you'll be doing it all the time. Granted, you're much more likely to be using something like ARM to do that. x86 is a bit large/expensive/power hungry for embedded programming.