A bit strange to use time to first token instead of throughput.
Latency to the first token is not like a web page where first paint already has useful things to show. The first token is "The ", and you'll be very happy it's there in 50ms instead of 200ms... but then what you really want to know is how quickly you'll get the rest of the sentence (throughput)
As far as benchmarketing goes they clearly went with prefill because it's much easier for apple to improve prefill numbers (flops-dominated) than decode (bandwidth-dominated, at least for local inference); M5 unified memory bandwidth is only about 10% better than the M4.
Not strange, for the kind of applications models at that size are often used for the prefill is the main factor in responsiveness. Large prompt, small completion.
No you don't. Not as a sticky mushy human with emotions watching tokens drip in. There's a lot of feeling and emotion not backed by hard facts and data going around, and most people would rather see something happening even if it takes longer overall. Hence spinner.gif, that doesn't actually remotely do a damned thing, but it gives users reassurance that they're waiting for something good. So human psychology makes time to first token an important metric to look at, although it's not the only one.
A bit strange to use time to first token instead of throughput.
Latency to the first token is not like a web page where first paint already has useful things to show. The first token is "The ", and you'll be very happy it's there in 50ms instead of 200ms... but then what you really want to know is how quickly you'll get the rest of the sentence (throughput)
As far as benchmarketing goes they clearly went with prefill because it's much easier for apple to improve prefill numbers (flops-dominated) than decode (bandwidth-dominated, at least for local inference); M5 unified memory bandwidth is only about 10% better than the M4.
In previous generations, throughout was excellent for an integrated GPU, but the time to first token was lacking.
So throughput was already good but TTFT was the metric that needed more improvement?
3 replies →
Not strange, for the kind of applications models at that size are often used for the prefill is the main factor in responsiveness. Large prompt, small completion.
I assume it’s time to first output token so it’s basically throughput. How fast can it output 8001 tokens
No you don't. Not as a sticky mushy human with emotions watching tokens drip in. There's a lot of feeling and emotion not backed by hard facts and data going around, and most people would rather see something happening even if it takes longer overall. Hence spinner.gif, that doesn't actually remotely do a damned thing, but it gives users reassurance that they're waiting for something good. So human psychology makes time to first token an important metric to look at, although it's not the only one.
Some kinds of spinners serve as a coal-mine canary indicating if the app has gotten wedged. Not hugely useful, but also not entirely useless.
I would consider it reasonable if this was 4x TTFT and Throughput, but it seems like it's only for TTFT.