Comment by scottlamb
5 days ago
Isn't the faster approach SIMD [edit: or GPU]? A 1.05x to 1.90x speedup is great. A 16x speedup is better!
They could be orthogonal improvements, but if I were prioritizing, I'd go for SIMD first.
I searched for asin on Intel's intrinsics guide. They have a AVX-512 instrinsic `_mm512_asin_ps` but it says "sequence" rather than single-instruction. Presumably the actual sequence they use is in some header file somewhere, but I don't know off-hand where to look, so I don't know how it compares to a SIMDified version of `fast_asin_cg`.
https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
I don’t know much about raytracing but it’s probably tricky to orchestrate all those asin calls so that the input and output memory is aligned and contiguous. My uneducated intuition is that there’s little regularity as to which pixels will take which branches and will end up requiring which asin calls, but I might be wrong.
I'd expect it to come down to data-oriented design: SoA (structure of arrays) rather than AoS (array of structures).
I skimmed the author's source code, and this is where I'd start: https://github.com/define-private-public/PSRayTracing/blob/8...
Instead of an `_objects`, I might try for a `_spheres`, `_boxes`, etc. (Or just `_lists` still using the virtual dispatch but for each list, rather than each object.) The `asin` seems to be used just for spheres. Within my `Spheres::closest_hit` (note plural), I'd work to SIMDify it. (I'd try to SIMDify the others too of course but apparently not with `asin`.) I think it's doable: https://github.com/define-private-public/PSRayTracing/blob/8...
I don't know much about ray tracers either (having only written a super-naive one back in college) but this is the general technique used to speed up games, I believe. Besides enabling SIMD, it's more cache-efficient and minimizes dispatch overhead.
edit: there's also stuff that you can hoist in this impl. Restructuring as SoA isn't strictly necessary to do that, but it might make it more obvious and natural. As an example, this `ray_dir.length_squared()` is the same for the whole list. You'd notice that when iterating over the spheres. https://github.com/define-private-public/PSRayTracing/blob/8...
It comes down to how "coherent" the rays are, and how much effort (compute) you want to put into sorting them into batches of rays.
With "primary" ray-tracing (i.e. camera rays, rays from surfaces to area lights), it's quite easy to batch them up and run SIMD operations on them.
But once you start doing global illumination, with rays bouncing off surfaces in all directions (and with complex materials, with multiple BSDF lobes, where lobes can be chosen stochastically), you start having to put a LOT of effort into sorting and batching rays such that they all (within a batch) hit the same objects or are going in roughly the same direction.
When I was working on this project, I was trying to restrict myself to the architecture of the original Ray Tracing in One Weekend book series. I am aware that things are not as SIMD friendly and that becomes a major bottle neck. While I am confident that an architectural change could yield a massive performance boost, it's something I don't want to spend my time on.
I think it's also more fun sometimes to take existing systems and to try to optimize them given whatever constraints exist. I've had to do that a lot in my day job already.
1 reply →
This tracks with my experience and seems reasonable, yes. I tend to SoA all the things, sometimes to my coworkers’ amusement/annoyance.
I don't do much float work but I don't think there is a single regular sine instruction only old x87 float stack ones.
I was curious what "sequence" would end up being but my compiler is too old for that intrinsic. Even godbolt didn't help for gcc or clang but it did reveal that icc produced a call https://godbolt.org/z/a3EsKK4aY
If you click libraries on godbolt, it's pulling in a bunch, including multiple SIMD libraries. You might have to fiddle with the libraries or build locally.
The issue is that the algorithm is only half the story. The implementation (e.g. bytecode) is the other.
I've been trying to find ways to make the original graphics renderer of the CGA version of Elite faster as there have been dozens of little optimizations found over the decades since it was written.
I was buoyed by a video of Super Mario 64/Zelda optimizations where it was pointed out that sometimes an approx calculation of a trig value can be quicker than a table lookup depending on the architecture.
Based on that I had conversations with LLMs over what fast trig algorithms there are, but for 8088 you are cooked most of the time on implementing them at speed.