Comment by pkroll
1 day ago
Since no one else posted it... I have open-webui pointed at a linux box with 128 gig of ram and an RTX Pro 6000, and after a couple of runs on trivia, had it do one of Open WebUI's conversation starters: "Show me a code snippet of a website's sticky header in CSS and JavaScript."
72.06 t/s. That's the full Qwen 3.6 27B model BF16, using MTP, running on Ollama. Yes I know I should bite the bullet and get vllm running on that box.
That was, also, at a 570 watt limit: I normally run a little less, but when I first tried this I actually forgot I had set the limit to 300 (it's a hot day, I figured why fight the A/C?), and at 300 watts the same question came back at 69.38 t/s. (The extra power matters more for compute bound things, the difference in generating LTX2.3 videos is considerably higher... but still not linear.)
No comments yet
Contribute on Hacker News ↗