Comment by LeFantome

2 months ago

Interestingly, RISC-V vector extensions are variable length.

So, you can compile your RISC-V software to require the equivalent of AVX and it will run on whatever size vectors the hardwre supports.

So, on x86-64, if I write AVX2 software and run it on AVX512 capable hardware, I am leaving performance on the table. But if I write software that uses AVX512, it will not run on hardware that does not support those extensions (flags).

On RISC-V, the same binary that uses 256 bit vectors on hardware that only supports that will use 512 bit vectors on hardware that supports it, or even 1024 bit vectors on hardware like the A100 cores of the SpacemiT K3.

So, I guess X86-64 is is the RyanAir of processors.

3 comments

LeFantome

janwas 2 months ago

(Personal opinion) I get the impression that RISC-V-related discussions often lack of awareness of prior work/alternatives. A large amount of (x86) software actually uses our Highway library to run on whatever size vectors and instructions the CPU offers.

This works quite well in practice. As to leaving performance on the table, it seems RVV has some egregious performance differences/cliffs. For example, should we use vrgather (with what LMUL), or interesting workarounds such as widening+slide1, to implement a basic operation such as interleaving two vectors?

camel-cdr 2 months ago
> For example, should we use vrgather (with what LMUL), or interesting workarounds such as widening+slide1, to implement a basic operation such as interleaving two vectors?
Use Zvzip, in the mean time:
zip: vwmaccu.vx(vwaddu.vv(a, b), -1, b), or segmented load/store when you are touching memory anyways
unzip: vsnrl
trn1/trn2: masked vslide1up/vslide1down with even/odd mask
The only thing base RVV does bad in those is register to register zip, which takes twice as many instructions as other ISAs. Zvzip gives you dedicated instructions of the above.
- janwas 2 months ago
  
  Looks like the ratification plan for Zvzip is November. So maybe 3y until HW is actually usable? That's a neat trick with wmacc, congrats. But still, half the speed for quite a fundamental operation that has been heavily used in other ISAs for 20+ years :(
  Great that you did a gap analysis [1]. I'm curious if one of the inputs for that was the list of Highway ops [2]?
  [1]: https://gist.github.com/camel-cdr/99a41367d6529f390d25e36ca3... [2]: https://github.com/google/highway/blob/master/g3doc/quick_re...