Comment by rz2k
2 days ago
In practice the 4bit MLX version runs at 20t/s for general chat. Do you consider that too slow for practical use?
What example tasks would you try?
2 days ago
In practice the 4bit MLX version runs at 20t/s for general chat. Do you consider that too slow for practical use?
What example tasks would you try?
Whenever reasoning/thinking is involved, 20t/s is way too slow for most non-async tasks, yeah.
Translation, classification, whatever. If the response is 300 tokens for the reasoning and 50 tokens for the final reply, you're sitting and waiting 17,5 seconds for processing one item. In practice, you're also forgetting about prefill, prompt processing, tokenization and such. Please do share all relevant numbers :)