← Back to context

Comment by danybittel

5 days ago

How do you do the rendering? Is it sorted (radix?) instances? Do you amortize the sorting over a couple frames? Or use some bin sorting? Are you happy with the performance?

Yes, Spark does instanced rendering of quads, one covering each Gaussian splat. The sorting is done by 1) calculating sort distance for every splat on the GPU, 2) reading it back to the CPU as float16s, 3) doing a 1-pass bucket sort to get an ordering of all the splats from back to front.

On most newer devices the sorting can happen pretty much every frame with approx 1 frame latency, and runs in parallel on a Web Worker. So the sorting itself has minimal performance impact, and because of that Spark can do fully dynamic 3DGS where every splat can move independently each frame!

On some older Android devices it can be a few frames worth of latency, and in that case you could say it's amortized over a few frames. But since it all happens in parallel there's no real impact to the overall rendering performance. I expect for most devices the sorting in Spark is mostly a solved problem, especially with increasing memory bandwidth and shared CPU-GPU memory.

  • If you say 1 pass bucket sorting.. I assume you do sort the buckets as well?

    I've implemented a radix sort on GPU to sort the splats (every frame).. and I'm not quite happy with performance yet. A radix sort (+ prefix scan) is quite involved with lot's of dedicated hierarchical compute shaders.. I might have to get back to tune it.

    I might switch to float16s as well, I'm a bit hesitant, as 1 million+ splats, may exceed the precision of halfs.

    • We are purposefully trading off some sorting precision for speed with float16, and for scenes with large Z extents you'd probably get more Z-fighting, so I'm not sure if I'd recommend it for you if your goal is max reconstruction accuracy! But we'll likely add a 2-pass sort (i.e. radix sort with a large base / #buckets) in the future for higher precision (user selectable so you can decide what's more important for you). But I will say that implementing a sort on the CPU is much simpler than on the GPU, so it opens up possibilities if you're willing to do a readback from GPU to CPU and tolerate at least 1 frame of latency (usually not perceivable).

      2 replies →