Comment by fulafel

9 hours ago

So it's not measuring output tokens/s, just how long it takes to start generating tokens. Seems we'll have to wait for independent benchmarks to get useful numbers.

3 comments

fulafel

dotancohen 7 hours ago

For many workflows involving real time human interaction, such as voice assistant, this is the most important metric. Very few tasks are as sensitive to quality, once a certain response quality threshold has been achieved, as is the software planning and writing tasks that most HN readers are likely familiar.

raw_anon_1111 3 hours ago

The way that voice assistants work even in the age of LLMs are:
Voice —> Speech to Text -> LLM to determine intent -> JSON -> API call -> response -> LLM -> text to speech.
TTFT is irrelevant, you have to process everything through the pipeline before you can generate a response. A fast model is more important than a good model
Source: I do this kind of stuff for call centers. Yes I know modern LLMs don’t go through the voice -> text -> LLM -> text -> voice anymore. But that only works when you don’t have to call external sources

Art9681 2 hours ago

It's going to be faster no matter what. My M3 MAX prints tokens faster than I can read for the new MoE models. It's the prompt processing that kills it when the context grows beyond a threshold which is easy to do in the modern agentic loops.