Comment by andai

1 month ago

I think ChatGPT has a similar feature. I was amazed how the reply starts coming in literally the moment I press enter. As far as I can tell that is only possible if all the previous tokens I submitted have already been processed. So when I actually submit the message, it only needs to update the inner state by one more token.

i.e. I think it's sending my message to the server continuously, and updating the GPU state with each token (chunk of text) that comes in.

Or maybe their set up is just that good and doesn't actually need any tricks or optimizations? Either way that's very impressive.

3 comments

andai

port3000 1 month ago

The 'flash' / no or low-thinking versions of those models are crazy fast. We often receive full response (not just first token) in less than 1 second via API.

ibaikov 1 month ago

Support systems often do this - they stream message and agents already see what you are typing. I know a few banking apps that do this.

altbdoor 1 month ago

> I think it's sending my message to the server continuously

It is, at least I see it for the first message when starting a new chat. If you open the network tools and type, you can see the text being sent to the servers on every character.

Source, from spending too much time analysing the network calls in ChatGPT to keep using mini models in a free account.