← Back to context

Comment by criemen

2 days ago

Thanks a lot for putting this together!

I have a question. In https://github.com/bentoml/llm-inference-in-production/blob/..., you have a single picture that defines TTFT and ITL. That does not match my understanding (but you guys know probably more than me): In the graphic, it looks like that the model is generating 4 tokens T0 to T3, before outputting a single output token.

I'd have expected that picture for ITL (except that then the labeling of the last box is off), but for TTFT, I'd have expected that there's only a single token T0 from the decode step, that then immediately is handed to detokenization and arrives as first output token (if we assume a streaming setup, otherwise measuring TTFT makes little sense).