Comment by cornholio

6 days ago

The context after application of the algorithm is just text, something like 256k input tokens, each token representing a group of roughly 2-5 characters, encoded into 18-20 bits.

The active context during inference, inside the GPUs, explodes each token into a 12288 dimensions vector, so 4 orders of magnitude more VRAM, and is combined with the model weights, Gbytes in size, across multiple parallel attention heads. The final result are just more textual tokens, which you can easily ferry around main system RAM and send to the remote user.