Comment by embedding-shape

2 months ago

Whenever reasoning/thinking is involved, 20t/s is way too slow for most non-async tasks, yeah.

Translation, classification, whatever. If the response is 300 tokens for the reasoning and 50 tokens for the final reply, you're sitting and waiting 17,5 seconds for processing one item. In practice, you're also forgetting about prefill, prompt processing, tokenization and such. Please do share all relevant numbers :)