Comment by Dylan16807

5 months ago

I'm not imagining toy sizes. Quite the opposite. I'm saying that layers are so big that splitting per neuron already gives you a ton of individual calculations to schedule and that's plenty to get full usage out of the hardware.

You can do very wide calculations on a single neuron if you want; throwing an entire SM (64 or 128 CUDA cores) at a single neuron is trivial to do in a deterministic way. And if you have a calculation so big you benefit from splitting it across SMs, doing a deterministic sum at the end will use an unmeasurably small fraction of your runtime.

And I'll remind you that I wasn't even talking about determinism across architectures, just within an architecture, so go ahead and optimize your memory layouts and block sizes to your exact card.