Comment by embedding-shape
2 days ago
Whenever reasoning/thinking is involved, 20t/s is way too slow for most non-async tasks, yeah.
Translation, classification, whatever. If the response is 300 tokens for the reasoning and 50 tokens for the final reply, you're sitting and waiting 17,5 seconds for processing one item. In practice, you're also forgetting about prefill, prompt processing, tokenization and such. Please do share all relevant numbers :)
No comments yet
Contribute on Hacker News ↗