Comment by SparkyMcUnicorn
16 hours ago
Here are some llama.cpp benchmarks for it: https://www.phoronix.com/review/intel-arc-pro-b70-linux/3
16 hours ago
Here are some llama.cpp benchmarks for it: https://www.phoronix.com/review/intel-arc-pro-b70-linux/3
Just ran llama-bench at home with the similar priced AMD AI PRO R9700 32G. The phoronix numbers look extremely low? Probably I misunderstand their test bench. Anyway, here are some numbers. Maybe someone with access to a B70 can post a comparison.
Tried to use the same model as the article:
llama-bench -m gpt-oss-20b-Q8_0.gguf -ngl 999 -p 2048 -n 128
AMD R9700 pp2048=3867 tg128=175
And a bigger model, because testing a tiny model with a 32GB card feels like a waste:
llama-bench -m Qwen3.6-27B-UD-Q6_K_XL.gguf -ngl 999 -p 2048 -n 128
AMD R9700 pp2048=917 tg128=22
As of b8966, it is still not great.
Edit: I've no idea why one would use gpt-oss-20b at Q8, but the result is basically the same:
Hopefully, support for the B70 will continue to improve. In retrospect, I probably should have bought a R9700 instead...
"I've no idea why one would use gpt-oss-20b at Q8" - would you mind expanding on this comment?
In that particular model family, the choices are 20B and 120B, so 20B higher quant fits in VRAM, while you'd be settling for 120B at a lower quant. Is it that 20B MXFP4 is comparable in performance so no need for Q8?
Or is the insight simply that there are better models available now and the emphasis is on gpt-oss-20b, not Q8?
1 reply →
At this speed, people end up paying more on electricity than api calls. (California electricity)
For reference in case it's interesting to someone, a 5090 on Windows 11 with CUDA 13.1
Using MXFP4 of GPT-OSS because it was trained quantization-aware for this quantization type, and it's native to the 50xx.
You can get 120TPS (144 peak) with Qwen3.6-27B on RTX PRO 6000 with autoround when MTP enabled. It runs faster than sonnet api calls.
5090 gets maybe 100TPS with MTP
the build they use is from February, over two months old: https://github.com/ggml-org/llama.cpp/releases/tag/b8121
Which might not sound like much, but 2months in llm time is a long time, especially regarding support for new hardware like the r9700.
Also from phoronix, a comparison with AMD R9700 and RTX 6000 Ada (because Nvidia has not sent them a blackwell card): https://www.phoronix.com/review/intel-arc-pro-b70/2