Comment by com2kid

2 years ago

I wish more places showed Time To First Token. For scenarios real time human interaction, the important part is how long until the first token is returned, and are tokens generated faster than people consume them.

Sadly very few benchmarks bother to track this.

Hi, we have this if you take a look at the models page (https://artificialanalysis.ai/models) and scroll down to 'Latency', and also on the API host comparison pages for each model (e.g. https://artificialanalysis.ai/models/llama-2-chat-70b)

  • Ah so you do!

    Your latency numbers for OpenAI (and Azure's equivalents) seem really high, I run time to first token tests and I see much better numbers!

    (Also are those numbers average, p50, p99, etc? I'd honestly expect a box plot to really see what is going on!)

    • Hey com2kid - if you're still there, we did end up adding boxplots to show variance. Can be seen on the models page https://artificialanalysis.ai/models and on each models page where you view hosts by clicking one of the models. They are toward the end of the page under 'Detailed performance metrics'