Comment by jjmarr
19 hours ago
From page 175 of the AMD CDNA4 ISA:
https://www.amd.com/content/dam/amd/en/documents/instinct-te...
> V_MUL_U32_U24
>,Multiply two unsigned 24-bit integer inputs and store the result as an unsigned 32-bit integer into a vector register. D0.u32 = 32'U(S0.u24) * 32'U(S1.u24)
> Notes
> This opcode is expected to be as efficient as basic single-precision opcodes since it utilizes the single-precision floating point multiplier. See also V_MUL_HI_U32_U24.
Nvidia GPUs used to do the same thing and theres a umul24 intrinsic if you care to use it.
https://stackoverflow.com/questions/5544355/cuda-umul24-func...
This is super-super-niche since it basically only applies to 32-bit integer multiplication.
You likely won't run into it unless you're doing high performance embedded systems or GPU programming on non-NVDIA cards, and for some unknowable reason, your workload does a 32-bit integer multiplication in the hot path.
That's literally only for 32bx24b (I don't remember why we did that specifically for CDNA - I'll ask someone) but as you see from V_MUL_HI_I32, V_MUL_LO_U32 there is very much vector arithmetic hardware (nevermind that we're not talking about VALU but conventional scalar ALU).
I think he has a point, but I am still not 100% convinced by the arguments relating to casting.
There is a difference between a u24 data type inside u32 and a u24 datatype inside u24 and that is what's so frustrating here. u24 is an alignment nightmare so it will basically never exist as "u24 in u24" and only ever as "u24 in u32".
For casting to make sense, the alignment must be compatible and it's not clear how you can simultaneously make arbitrary bit data types simultaneously useful for the scenario of describing bit fields in packets, where padding is inherently undesirable and performing integer arithmetic with an FPU, where padding is an acceptable cost for alignment. These appear to be mutually exclusive use cases.