Comment by recursivecaveat

1 year ago

If you look at a model with a diverse competitive provider set like llama 3 the latency is 1/4 second, and it will definitely improve at a minimum incrementally if quality is held constant: https://artificialanalysis.ai/models/llama-3-3-instruct-70b/... Remember that as long as you experience the response linearly (very much the case for audio output for eg) then the first-chunk latency is your actual latency, not to stream the entire response.

0 comments

recursivecaveat

No comments yet

Contribute on Hacker News ↗