Comment by neonsunset

4 months ago

Just scalar code? I was hoping to see some Goasm here for acceptable performance (or you could rewrite it in F#/C# which provide appropriate SIMD primitives).

edit: to answer my own question, when inspected with Ghidra, this implementation indeed compiles to very slow scalar code (operates on single fp64 values).

9 comments

neonsunset

chrsig 4 months ago

i just hope for a sufficiently smart compiler shrug (i'm pretty sure go has some autovectorization)

before jumping to another language, I suggest perhaps examine the memory layout and access patterns.

neonsunset 4 months ago
The code there is written in a fairly auto-vectorizeable way. But the actual capabilities of Go's compiler are very far away from this despite public expectation (and autovectorization is brittle, writing inference or training in a way that relies on it is the last thing you want). To put it in perspective, until 2021 Go was always passing the data on the stack on function calls. It has improved since then but the overall design aims to ensure common scenarios are fast (e.g. comparisons against string literals are unrolled) but once you venture outside that or if it's an optimization that requires more compiler complexity - Go is far less likely to employ it.
- chrsig 4 months ago
  
  > and autovectorization is brittle, writing inference or training in a way that relies on it is the last thing you want)
  I'm curious if you could speak more to this? Is the concern that operations may get reordered?
  > To put it in perspective, until 2021 Go was always passing the data on the stack on function calls. It has improved since then but the overall design aims to ensure common scenarios are fast (e.g. comparisons against string literals are unrolled) but once you venture outside that or if it's an optimization that requires more compiler complexity - Go is far less likely to employ it.
  I agree with this assesment.
  The individual operations in the repository (e.g., dot product) look like they could be autovectorized. I'm assuming they aren't because of the use of a slice. I'm mildly curious if it could be massaged into something autovectorized.
  Most of my observations re: autovectorization in go have been on fixed sized vectors and matrices where SSE2 instructions are pretty readily available and loop unrolling is pretty simple.
  I'm curious what it would produce with the matrix in a single slice rather than independent allocations. Not curious enough to start poking at it, just curious enough to ramble about it conversationally.
  
  6 replies →