Comment by imtringued

2 months ago

>It’s a peculiar feature of transformer-based LLMs that computing a batch of completions at the same time is almost as fast as computing a single completion. Why is that?

Incorrect. Transformers usually contain a classical MLP layer. Only the MLP layer can be batched. Hence all classical neural networks including convolutional networks (via im2col) can be batched.

If there's anything that the transformer architecture changes, it is that the attention layer cannot be batched.

1 comment

imtringued

krackers 2 months ago

Yeah this part was confusing, because it's only mentioned halfway through the article that the attention step can only be batched across matching context-window sizes.