Comment by andai
12 hours ago
I think ChatGPT has a similar feature. I was amazed how the reply starts coming in literally the moment I press enter. As far as I can tell that is only possible if all the previous tokens I submitted have already been processed. So when I actually submit the message, it only needs to update the inner state by one more token.
i.e. I think it's sending my message to the server continuously, and updating the GPU state with each token (chunk of text) that comes in.
Or maybe their set up is just that good and doesn't actually need any tricks or optimizations? Either way that's very impressive.
The 'flash' / no or low-thinking versions of those models are crazy fast. We often receive full response (not just first token) in less than 1 second via API.