Comment by ipieter
2 months ago
Distributing inference per layer, instead of splitting each layer across gpus, is indeed another approach, called pipeline parallelism. However, per batch there is less compute (only 1 gpu at a time), so inference is slower. In addition, the orchestration of starting the next batch on gpu #0 while gpu #1 starts is quite tricky. For this reason, tensor parallelism as I described is way more common in LLM inference.
In what software? llama.cpp and others divide things by layers.