I think many would assume "not enterprise" or "not datacenter grade" when someone says "Standard GPUs", but maybe that specific phrase have a specific meaning I'm not familiar with.
Edit: I just tried a 4B model on a RTX Pro 6000, getting ~500 tok/s with llama.cpp not even trying to optimize or change anything, just default settings. I'm sure with vLLM it'd be a lot faster already, still before manually tuning configs. I wouldn't call that card "Standard GPU" either FWIW, but it makes the claimed performance numbers feel not as exciting, especially given the hardware they were using.
as not custom chips like Grog and Cerebras. Did you expect a single GPU chip to reach 3k tps?
I think many would assume "not enterprise" or "not datacenter grade" when someone says "Standard GPUs", but maybe that specific phrase have a specific meaning I'm not familiar with.
Edit: I just tried a 4B model on a RTX Pro 6000, getting ~500 tok/s with llama.cpp not even trying to optimize or change anything, just default settings. I'm sure with vLLM it'd be a lot faster already, still before manually tuning configs. I wouldn't call that card "Standard GPU" either FWIW, but it makes the claimed performance numbers feel not as exciting, especially given the hardware they were using.
I expected a 4090, maybe 2. I did not expect 8xH200 for a 2B model.
2 replies →
> Did you expect a single GPU chip to reach 3k tps?
Did the article headline not say Standard GPU?
so what would be the above-standard GPUs then that they are excluding? Cerebras is not GPU
Everyone beholden to a data center or subject to the installation on the corner of your property of course. Keep up with the times... /s