Comment by ryao
2 months ago
> As models become bigger, this does not scale anymore because the model weights will not fit into GPU memory anymore and you need to distribute them across GPUs or across nodes. Even with NVLink and Infiniband, these communications are slower than loading from VRAM. NVlink is still fine for tensor parallelism, but across nodes this is quite slow.
Inference works by computing layers and then have a very small vector that you send to the next layer as input. When a model does not fit in a single GPU, you just divide it into layers and send the vector over a fabric to the GPU holding the next layer. The transfer happens so quickly that there is a negligible amount of idle time and then the next layer can be computed. The fastest inference on the planet at Cerebras uses this technique to do 2500T/sec on Llama 4 Maverick.
Groq and Cerebras both take a big chip approach to architecture and, at least in the case of Groq, they only make economic sense under high batch loads.
https://x.com/swyx/status/1760065636410274162?s=46
There is nothing big about Groq’s chips. Their individual chips have only 230 MB RAM. Unlike Cerebras, which can load multiple layers into a single chip, grok must divide a layer across many chips.
Distributing inference per layer, instead of splitting each layer across gpus, is indeed another approach, called pipeline parallelism. However, per batch there is less compute (only 1 gpu at a time), so inference is slower. In addition, the orchestration of starting the next batch on gpu #0 while gpu #1 starts is quite tricky. For this reason, tensor parallelism as I described is way more common in LLM inference.
In what software? llama.cpp and others divide things by layers.