← Back to context

Comment by packetlost

1 year ago

That seems like something the compiler would generally handle, no? Obviously that doesn't apply everywhere, but in the general case it should.

Vector stuff is typically hand coded with intrinsics or assembly. Autovectorization has mixed results because there’s no way to request the compiler to promise that it vectorized the code.

But for an emulator like this, box64 has to pick how to emulate vectorized instructions on RiscV (eg slowly using scalars or trying to reimplement using native vector instructions). The challenge of course is that typically you don’t get as good a performance unless the emulator can actually rewrite the code on the fly because a 1:1 mapping is going to be suboptimal vs noticing patterns of high level operations being performed and providing a more optimized implementation that replaces an alternate chunk of instructions at once instead to account for implementation differences on the chip (eg you may have to emulate missing instructions but a rewriter could skip emulation if there’s an alternate way to accomplish the same high level computation)

The biggest challenge for something like this from a performance perspective of course will be translating the GPU stuff efficiently to hit the native driver code and that Riscv likely is relying on OSS GPU drivers (and maybe wine to add another translation layer if the game is windows only )

  • I'd assume it uses RADV, same as the Steam Deck. For most workloads that's faster than AMD's own driver. And yes, it uses Wine and DXVK. As dar as the game is concerned it's running on a DirectX-capable x86 Windows machine. That's a lot of translation layers.

  • On clang, you can actually request that it gives a warning on missed vectorization of a given loop with "#pragma clang loop vectorize(enable)": https://godbolt.org/z/sP7drPqMT (and you can even make it an error).

    There's even "#pragma clang loop vectorize(assume_safety)" to tell it that pointer aliasing won't be an issue (gcc has a similar "#pragma GCC ivdep"), which should get rid of most odd reasons for missed vectorization.

  • > Vector stuff is typically hand coded with intrinsics or assembly. Autovectorization has mixed results because there’s no way to request the compiler to promise that it vectorized the code.

    Right, but most of the time those are architecture specific and RVV 1.0 is substantially different than say, NEON or SSE2, so you need to change it anyways. You also typically use specialized registers for those, not the general purpose registers. I'm not saying there isn't work to be done (especially in for an application like this one, that is extremely performance sensitive), I'm saying that most applications won't have these problems are be so sensitive that register spills matter much if at all.

    • I’m highlighting that the compiler doesn’t automatically take care of vector code quite as automatically and as well as it does register allocation and instruction selection which are slightly more solved problems. And it’s easy to imagine that a compiler will fail to optimize a piece of code as well on something that’s architecturally quite novel. RISCV and ARM aren’t actually hugely dissimilar architectures at a high level that completely different optimization need to be written and even selectively weighted by architecture, but I imagine something like a Mill CPU might require quite a reimagining to get anything approaching optimal performance.

  • I read somewhere that since floating point addition is not associative the compiler will not autovectorize because the order might change.

It's something that the compiler would handle, but can still moderately influence programming decisions, i.e. you can have a lot more temporary variables before things start slowing down due to spill stores/loads (esp. in, say, a loop with function calls, as more registers also means more non-volatile registers (i.e. those that are guaranteed to not change across function calls)). But, yes, very limited impact even then.

  • It's certainly something I would take into consideration when making a (language) runtime, but probably not at all during all but the most performance sensitive of applications. Certainly a difference, but far lower level than what most applications require

    • Yep. Unfortunately I am one to be making language runtimes :)

      It's just the potentially most significant thing I could come up with at first. Though perhaps RVV not being in rva20/rv64gc is more significant.

      1 reply →