Comment by mtone
5 hours ago
Here's a DeepSeek-V4-Flash benchmark on 2X RTX Pro 6000:
- Prefill: ~10K tok/s
- Decode: 190 | 375 | 980 tok/s (for 1 | 4 | 16 concurrent requests)
- GPU power draw during benchmark: Average: 585W | Max: 849W | Limit: 1200W with undervolt. Idle PC is 125W.
I've asked it to calculate the following considering a realistic blend of cached prompts and decode for agentic dev scenario.
Electricity-only (@ USD $0.08/kWh)
Usage | IN price | OUT price | Monthly cost
Concurrency=1 | $0.040/M | $0.080/M | $8.65 to $38.88 (5% to 100% active)
Concurrency=4 | $0.024/M | $0.044/M | up to $48.67 (cheaper per token but higher power draw)
Total cost of ownership over 3 years is electricity + USD $20K (pre-hike pricing). In a production scenario, how much would I have to charge my users to break even, aiming for 4 concurrent requests 24/7?
A) Breakeven API pricing (est. 2B IN + 1B OUT throughput/month):
IN price OUT price
Self-hosted $0.121/M $0.363/M
OpenRouter (budget) $0.098/M $0.196/M
OpenRouter (DeepSeek) $0.140/M $0.280/M
B) Breakeven subscription (users active ~1.5h/day):
1 user: $563/mo (oh, hai)
25 users: $23/mo
100 users: $6/mo
Vouched your comment. Very cool. What are you running on to get 190 tok/s? I get 400 tok/s at c=4 but c=1 is slower than you.