Comment by mtone

5 hours ago

Here's a DeepSeek-V4-Flash benchmark on 2X RTX Pro 6000:

  - Prefill: ~10K tok/s
  - Decode: 190 | 375 | 980 tok/s (for 1 | 4 | 16 concurrent requests)
  - GPU power draw during benchmark: Average: 585W | Max: 849W | Limit: 1200W with undervolt. Idle PC is 125W.

I've asked it to calculate the following considering a realistic blend of cached prompts and decode for agentic dev scenario.

Electricity-only (@ USD $0.08/kWh)

  Usage          | IN price  | OUT price | Monthly cost
  Concurrency=1  | $0.040/M  | $0.080/M  | $8.65 to $38.88 (5% to 100% active)
  Concurrency=4  | $0.024/M  | $0.044/M  | up to $48.67 (cheaper per token but higher power draw)

Total cost of ownership over 3 years is electricity + USD $20K (pre-hike pricing). In a production scenario, how much would I have to charge my users to break even, aiming for 4 concurrent requests 24/7?

A) Breakeven API pricing (est. 2B IN + 1B OUT throughput/month):

                        IN price    OUT price
  Self-hosted           $0.121/M    $0.363/M
  OpenRouter (budget)   $0.098/M    $0.196/M
  OpenRouter (DeepSeek) $0.140/M    $0.280/M

B) Breakeven subscription (users active ~1.5h/day):

    1 user: $563/mo (oh, hai)
    25 users: $23/mo
    100 users: $6/mo

1 comment

mtone

arjie 1 hour ago

Vouched your comment. Very cool. What are you running on to get 190 tok/s? I get 400 tok/s at c=4 but c=1 is slower than you.