Comment by exDM69

7 hours ago

I started writing a software triangle rasterizer. Not sure exactly why, but I just felt like doing it.

Like many software rasterizer projects I used Fabian Giesen's software rasterizer blog series [0] as the baseline.

For solid color triangles with depth testing my 10+ year old laptop achieve ~3.2 Gpixels/s fill rate, which is above 80% of the available memory bandwidth (~26 GiB/s at 64 bits per pixel), using memset as the baseline comparison.

I used Rust and std::simd. The code can be compiled for SSE2, NEON, AVX2 or AVX512 by changing compiler parameters. The inner loop uses 16-wide vectors (512 bits) although my computer only has 8-wide AVX2, but the compiler deals with that. I used generics so I can change vector width easily and have multiple vector widths in the same binary for benchmarking. 16-wide is about 10-20% faster than 8-wide. I was excited to see that AVX masked store instructions get used even though I did not explicitly write masked stores in the code. I spent a lot of time reading the disassembly of the generated code and it's very tight.

The performance falls off a cliff (170 Mpixels/s) once I introduce a "shader" in the inner loop because it is not SIMD friendly (one pixel at a time, not 16 pixels). But that is fine, I am intending to use this with visibility buffer style rendering (store integer triangle id's in color buffer) and/or software occlusion culling (depth buffer only). Neither technique need anything more than solid colors and z-buffer.

[0] https://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlu...