The scalar efficiency SIG has already been discussing bitfield insert and extract instructions.
We figured out yesterday [1], that the example in the article can already be done in four risc-v instructions, it's just a bit trickier to come up with it:
Nice trick, in fact with 4 instructions it's as efficient as extract/insert and it works for all ADD/SUB/OR/XOR/CMP instructions (not for AND), except if the source is a high-byte register. However it's not really a problem if code generation is not great in this case: compilers in practice will not generate accesses to these registers, and while old 16-bit assembly code has lots of such accesses it's designed to run on processors that ran at 4-20 MHz.
Flag computation and conditional jumps is where the big optimization opportunities lie. Box64 uses a multi-pass decoder that computes liveness information for flags and then computes flags one by one. QEMU instead tries to store the original operands and computes flags lazily. Both approaches have advantages and disadvantages...
That's just one more instruction, to right-align the AH, BH etc src operand prior to exactly the same instructions as above.
And, yes, this being 64 bit code compilers won't be generating such instructions. In fact they started avoiding them as soon as OoO hit in the Pentium Pro, P II, P III etc in the mid 90s because of "partial register update stalls".
In fact, bitfield extract is such an obvious oversight that it is my favourite example of how idiotic the RISCV ISA is (#2 is lack of sane addressing modes).
Whoa, someone else who doesn't believe that the RISC-V ISA is 'perfect'!
I'm curious: how the discussions on the bitfield extract have been going? Because it does really seem like an obvious oversight and something to add as a 'standard extension'.
What's your take on
1) unaligned 32bit instructions with the C extension?
2) lack of 'trap on overflow' for arithmetic instructions?
MIPS had it..
IMHO they made a mistake by not allowing immediate data to follow instructions. You could encode 8 bit constants within the opcode, but anything larger should be properly supported with immediate data. As for the C extension, I think that was also inferior because it was added afterward. I'd like to see a re-encoding of the entire ISA in about 10 years once things are really stable.
The handling of misaligned loads/stores in RISC-V is also can be considered a disappointing point: https://github.com/riscv/riscv-isa-manual/issues/1611 It oozes with preferring convenience of hardware developers and "flexibility" over making practical guarantees needed by software developers. It looks like the MIPS patent on misaligned load/store instructions has played its negative role. The patent expired in 2019, but it seems we are stuck with the current status quo nevertheless.
1. aarch64 does this right. RISCV tries to be too many things at once, and predictably ends up sucking at everything. Fast big cores should just stick to fixed size instrs for faster decode. You always know where instrs start, and every cacheline has an integer number of instrs. microcontroler cores can use compressed intrs, since it matters there, while trying to parallel-codec instrs does not matter there. Trying to have one arch cover it all is idiotic.
2. nobody uses it on mips either, so it is likely of no use.
Bitfield-extract is being discussed for a future extension. E.g. Qualcomm is pressing for it to be added.
In the meantime, it can be done as two shifts: left to the MSB, and then right filling with zero or sign bits.
There is at least one core in development (SpaceMiT X100) that is supposed to be able to fuse those two into a single µop, maybe some that already do.
However, I've also seen that one core (XianShan Nanhu) is fusing pairs of RVI instructions into one in the B extension, to be able to run old binaries compiled for CPUs without B faster.
Throwing hardware at the problem to avoid a recompile ... feels a bit backwards to me.
The scalar efficiency SIG has already been discussing bitfield insert and extract instructions.
We figured out yesterday [1], that the example in the article can already be done in four risc-v instructions, it's just a bit trickier to come up with it:
[1] https://www.reddit.com/r/RISCV/comments/1f1mnxf/box64_and_ri...
Nice trick, in fact with 4 instructions it's as efficient as extract/insert and it works for all ADD/SUB/OR/XOR/CMP instructions (not for AND), except if the source is a high-byte register. However it's not really a problem if code generation is not great in this case: compilers in practice will not generate accesses to these registers, and while old 16-bit assembly code has lots of such accesses it's designed to run on processors that ran at 4-20 MHz.
Flag computation and conditional jumps is where the big optimization opportunities lie. Box64 uses a multi-pass decoder that computes liveness information for flags and then computes flags one by one. QEMU instead tries to store the original operands and computes flags lazily. Both approaches have advantages and disadvantages...
> except if the source is a high-byte register
That's just one more instruction, to right-align the AH, BH etc src operand prior to exactly the same instructions as above.
And, yes, this being 64 bit code compilers won't be generating such instructions. In fact they started avoiding them as soon as OoO hit in the Pentium Pro, P II, P III etc in the mid 90s because of "partial register update stalls".
Actually, Box64 can also store operands for later computation, depending on what comes next...
Author here, we have adopted this approach as a fast path to box64: https://github.com/ptitSeb/box64/pull/1763, thank you very much!
None of this is new. None of it.
In fact, bitfield extract is such an obvious oversight that it is my favourite example of how idiotic the RISCV ISA is (#2 is lack of sane addressing modes).
Some of the better RISCV designs, in fact, implement a custom instr to do this, eg: BEXTM in Hazard3: https://github.com/Wren6991/Hazard3/blob/stable/doc/hazard3....
Whoa, someone else who doesn't believe that the RISC-V ISA is 'perfect'! I'm curious: how the discussions on the bitfield extract have been going? Because it does really seem like an obvious oversight and something to add as a 'standard extension'.
What's your take on
1) unaligned 32bit instructions with the C extension?
2) lack of 'trap on overflow' for arithmetic instructions? MIPS had it..
IMHO they made a mistake by not allowing immediate data to follow instructions. You could encode 8 bit constants within the opcode, but anything larger should be properly supported with immediate data. As for the C extension, I think that was also inferior because it was added afterward. I'd like to see a re-encoding of the entire ISA in about 10 years once things are really stable.
1 reply →
The handling of misaligned loads/stores in RISC-V is also can be considered a disappointing point: https://github.com/riscv/riscv-isa-manual/issues/1611 It oozes with preferring convenience of hardware developers and "flexibility" over making practical guarantees needed by software developers. It looks like the MIPS patent on misaligned load/store instructions has played its negative role. The patent expired in 2019, but it seems we are stuck with the current status quo nevertheless.
1. aarch64 does this right. RISCV tries to be too many things at once, and predictably ends up sucking at everything. Fast big cores should just stick to fixed size instrs for faster decode. You always know where instrs start, and every cacheline has an integer number of instrs. microcontroler cores can use compressed intrs, since it matters there, while trying to parallel-codec instrs does not matter there. Trying to have one arch cover it all is idiotic.
2. nobody uses it on mips either, so it is likely of no use.
16 replies →
Bitfield-extract is being discussed for a future extension. E.g. Qualcomm is pressing for it to be added.
In the meantime, it can be done as two shifts: left to the MSB, and then right filling with zero or sign bits. There is at least one core in development (SpaceMiT X100) that is supposed to be able to fuse those two into a single µop, maybe some that already do.
However, I've also seen that one core (XianShan Nanhu) is fusing pairs of RVI instructions into one in the B extension, to be able to run old binaries compiled for CPUs without B faster. Throwing hardware at the problem to avoid a recompile ... feels a bit backwards to me.