Comment by nubg 11 days ago are parallel requests "free"? or do you half performance when sending two requests in parallel? 1 comment nubg Reply anon373839 10 days ago I have seen ~1,300 tokens/sec of total throughout with Llama 3 8B on a MacBook Pro. So no, you don’t halve the performance. But running batched inference takes more memory, so you have to use shorter contexts than if you weren’t batching.
anon373839 10 days ago I have seen ~1,300 tokens/sec of total throughout with Llama 3 8B on a MacBook Pro. So no, you don’t halve the performance. But running batched inference takes more memory, so you have to use shorter contexts than if you weren’t batching.
I have seen ~1,300 tokens/sec of total throughout with Llama 3 8B on a MacBook Pro. So no, you don’t halve the performance. But running batched inference takes more memory, so you have to use shorter contexts than if you weren’t batching.