Comment by andai
14 hours ago
I think ChatGPT has a similar feature. I was amazed how the reply starts coming in literally the moment I press enter. As far as I can tell that is only possible if all the previous tokens I submitted have already been processed. So when I actually submit the message, it only needs to update the inner state by one more token.
i.e. I think it's sending my message to the server continuously, and updating the GPU state with each token (chunk of text) that comes in.
Or maybe their set up is just that good and doesn't actually need any tricks or optimizations? Either way that's very impressive.
> I think it's sending my message to the server continuously
It is, at least I see it for the first message when starting a new chat. If you open the network tools and type, you can see the text being sent to the servers on every character.
Source, from spending too much time analysing the network calls in ChatGPT to keep using mini models in a free account.
The 'flash' / no or low-thinking versions of those models are crazy fast. We often receive full response (not just first token) in less than 1 second via API.