Comment by Dylan16807
5 months ago
I'm not imagining toy sizes. Quite the opposite. I'm saying that layers are so big that splitting per neuron already gives you a ton of individual calculations to schedule and that's plenty to get full usage out of the hardware.
You can do very wide calculations on a single neuron if you want; throwing an entire SM (64 or 128 CUDA cores) at a single neuron is trivial to do in a deterministic way. And if you have a calculation so big you benefit from splitting it across SMs, doing a deterministic sum at the end will use an unmeasurably small fraction of your runtime.
And I'll remind you that I wasn't even talking about determinism across architectures, just within an architecture, so go ahead and optimize your memory layouts and block sizes to your exact card.
No comments yet
Contribute on Hacker News ↗