Comment by Dylan16807

5 months ago

Each thread on a CPU will go in the same order.

Why would the reduction step of a single neuron be split across multiple threads? That sounds slower and more complex than the naive method. And if you do decide to write code doing that, then just the code that reduces across multiple blocks needs to use integers, so pretty much no extra effort is needed.

Like, is there a nondeterministic-dot-product instruction baked into the GPU at a low level?

4 comments

Dylan16807

TeMPOraL 5 months ago

> Each thread on a CPU will go in the same order.

Not unless you control the underlying scheduler and force deterministic order; knowledge of all the code running isn't sufficient, as some factors affecting threading order are correlated with physical environment. For example, minute temperature gradient differences on the chip between two runs could affect how threads are allocated to CPU cores and order in which they finish.

> Why would the reduction step of a single neuron be split across multiple threads?

Doesn't have to, but can, depending on how many inputs it has. Being able to assume commutativity gives you a lot of flexibility in how you parallelize it, and allows you to minimize overhead (both in throughput and memory requirements).

> Like, is there a nondeterministic-dot-product instruction baked into the GPU at a low level?

No. There's just no dot-product instruction baked into GPU at low level that could handle vectors of arbitrary length. You need to write a loop, and that usually becomes some kind of parallel reduce.

Dylan16807 5 months ago
> could affect how threads are allocated to CPU cores and order in which they finish
I'm very confused by how you're interpreting the word "each" here.
> Being able to assume commutativity gives you a lot of flexibility in how you parallelize it, and allows you to minimize overhead (both in throughput and memory requirements).
Splitting up a single neuron seems like something that would only increase overhead. Can you please explain how you get "a lot" of flexibility?
> You need to write a loop, and that usually becomes some kind of parallel reduce.
Processing a layer is a loop within a loop.
The outer loop is across neurons and needs to be parallel.
The inner loop processes every weight for a single neuron and making it parallel sounds like extra effort just to increase instruction count and mess up memory locality and make your numbers less consistent.
- TeMPOraL 5 months ago
  
  I feel like you're imagining a toy network with couple dozen neurons in few layers, done on a CPU. But consider a more typical case of dozens of layers with hundreds (or thousands) of neurons each. That's some thousand numbers to reduce per each neuron.
  Then, remember that GPUs are built around thousands of tiny parallel processors, each able to process a bunch (e.g. 16) parallel threads, but then the threads have to run in larger batches (SIMD-like), and there's a complex memory management architecture built-in, over which you only have so much control. Specific numbers of cores, threads, buffer sizes, as well as access patterns, differ between GPU models, and for optimal performance, you have to break down your computation to maximize utilization. Or rather, have the runtime do it for you.
  This ain't an an FPGA, you don't get to organize hardware to match your network. If you have a 1000 neurons per hidden layer, then individual neurons likely won't fit on a single CUDA core, so you will have to split them down the middle, at least if you're using full-float math. Speaking of, the precision of the numbers you use is another parameter that adds to the complexity.
  On the one hand, you have a bunch of mostly-linear matrix algebra, where you can tune precision. On the other hand, you have a GPU-model-specific number of parallel processors (~thousands), that can fit only so much memory, can run some specific number of SIMD-like threads in parallel, and most of those numbers are powers of two (or a multiple of), so you have also alignment to take into account, on top of memory access patterns.
  By default, your network will in no way align to any of that.
  It shouldn't be hard to see that assuming commutativity gives you (or rather the CUDA compiler) much more flexibility to parallelize your calculations by splitting it whichever way it likes to maximize utilization.
  
  1 reply →