Comment by hansvm

2 months ago

It works for a lot of families but not all. You need a high enough degree of sharing of model weights between different queries for that to make sense (memory access being the usual bottleneck nowadays, though smaller models see something similar with matmul batch efficiencies for CPU related reasons).

Fully connected transformers trivially work (every weight for every query). MoE works beyond a certain size or with certain types of mixing (still using every weight, or using a high enough fraction that there's some sharing with batches of 20+ queries). As you push further that direction though (lots of techniques, but the key point being accessing less of the model at once and bypassing some of it for each query), you need larger and larger batches for those efficiency gains to materialize. At some point it becomes untenable because of latency waiting for batches of data, and past that it becomes untenable because of the volume of query data.