Comment by logdahl
8 months ago
Well, the core issue is still drawing. I took another look at some profiles again and seems like its not the renderer limiting this to 27k! I still had some stupid scene-graph traversal... But clustering and culling is 53us and 33us respectively, but the draw is 7ms. So a frame (on the GPU-side) is like 7ms, and some 100-200 us on the CPU side.
Should really dive deeper and update the measurements for final results...
I haven't look at the post in the detail it deserves, but given your graphs the workload looks pretty bursty. I'd suspect there are some good I/O optimizations or some predication. Definitely that last void main block looks ripe for that. But I'd listen to Knuth, premature optimization and all, so grab a profiler. I wouldn't be surprised if you're nearing peak performance. Also NVIDIA GPUs have a lot of special tricks that can be exploited but are buried in documentation... if you haven't already seen it (I suspect you have), you'd be interested in "GPU Gems". Gems 2 has some good stuff on predication.
But also, really good work! You should be proud of this! Squeezing that much out of that hardware is no easy feat.