Comment by anon373839
6 hours ago
I have seen ~1,300 tokens/sec of total throughout with Llama 3 8B on a MacBook Pro. So no, you don’t halve the performance. But running batched inference takes more memory, so you have to use shorter contexts than if you weren’t batching.
No comments yet
Contribute on Hacker News ↗