Comment by nvme0n1p1

1 day ago

For your specific case, even a simple `[9][9]u16` might perform better (where you make use of nine bits in each u16). For each entry, the nine mask bits would be in the same bit positions, so the compiler won't have to do a bunch of shifts to extract/align the bits. CPUs love consistency. I doubt it's worth the additional codegen complexity to save 70 bytes in your data model.

My initial implementation used [9][9]u9 (which desugars to [9][9]u16 with some zero bits) and was a fair bit slower. If I had to guess, it's because the shift/extract/align you're describing isn't actually a part of the core solving algorithm, and when you have box constraints, knights-move constraints, etc, you're usually not doing anything which fits in a single u16.