Comment by Validark

2 months ago

Classic Matu3ba, at it again, coaxing me down new rabbit holes. Hahaha. I hope you know I became a SWAR wizard for you.

That video is very interesting to me. I don't yet know enough about all those languages to speak intelligently on how they could be used to simplify my code, but I do agree it would be amazing if we could generalize my techniques so that every project could easily achieve what I have through a lot of hard work.

I'm not sure if the specific thing demo'ed in that video is helpful for my use-case. That workload, as well as other examples on the site, seem(s) to get a ton of mileage because their data is two-dimensional. Therefore there are a lot of different choices that could be made about the order in which things should be done. Being able to switch those orders quickly to test out how well vectorization works for different choices is therefore a major benefit to those examples.

The tokenizer, on the other hand, is one-dimensional. It's obvious what's the best way to traverse a source file: Just go forward.

Automatic scaling to bigger or smaller vectors could be easier though, but how I want that to be achieved atm is for LLVM to have:

1. u512 operations happening in AVX-512 vectors

2. ARM backend support for automatically preferring interleaved vectors over regularly-ordered vectors. Maybe even better would be a type at the language level.

3. A SWAR emitter for LLVM vectors for machines without vectors

4. Movemask support in the PowerPC, WASM, and MIPS backends. I don't want to write intrinsics for every little bitcast.

I'm also just skeptical in general that offloading to the compiler is a good idea with the current state of the technology. There are just too many routines it doesn't know about, it misses canonicalization too much, it doesn't seem to be architected correctly for having several register files, it doesn't seem to be architected correctly for supporting bit manipulation efficiently, and the fact that SWAR support hasn't arrived yet at this point in history doesn't inspire confidence either.

In cases like the video shows where you're just coercing the compiler to do basic auto-vectorization, I'm sure that works perfectly.

But in this work, my hunch is that the optimal solution could not be trivially written in a nice functional style. This code does not use the "easy solution" of being able to reach forwards and backwards for data. I.e. The easy solution expressed as some pattern matching over `{ file[p-1], file[p], file[p+1], ... }` (I believe) imposes a constraint that we have to have read in the next vector and produce the bitstrings for it in order to process the current vector. For simple languages this is fine but we have to do a lot of work for something like Zig and you just don't have enough registers to go around. That's why my solution uses pseudo-starts (putting a fake start position at the beginning of a chunk due to carried-over values) and a more manual carry-over system. The logic here is NOT position-independent, solely for performance reasons. Do I think the necessary transformation between position-agnostic logic and the logic I'm talking about could be converted automatically? Maybe. But my suspicion is that a parser generator makes more sense as the sort of tool that should solve this.

0 comments

Validark

No comments yet

Contribute on Hacker News ↗