Comment by zozbot234

1 day ago

Wouldn't that be a fairly ideal setup for layer parallelism? That doesn't need the high-performance communication of tensor parallelism, and the high-concurrency regime would make it easy to keep the pipeline full with microbatches. You'd also be able to scale out your KV cache storage since that naturally splits layer-wise.

0 comments