Comment by elcritch

5 hours ago

Running inference requires sharing intermediate matrix results between nodes. Faster networking speeds that up.

I read (but cannot find this anymore) that the information sent from layer to layer is minimal. The actual matrix work happens within a layer. They are not doing matrix multiplication over the netwerk (that would be insane latency wise).