Comment by kingstnap

7 months ago

He has GLM 4.5 Running at ~100 Tokens per second.

Assumptions:

Batch 4x and get 400 tokens per second and push his power consumption to 900W instead of the underutilized 300W.

Electricity around €0.2/kWhr.

Tokens valued at €1/1M out.

Assume ~70% utilization.

Result:

You get ~1M tokens per hour which is a net profit of ~€0.8/hr. Which is a payoff time of a bit over a year or so given the €9K investment.

Honestly though there is a lot of handwaving here. The most significant unknown is getting high utilization with aggressive batching and 24/7 load.

Also the demand for privacy can make the utility of the tokens much higher than typical API prices for open source models.

In a sort of orthogonal way renting 2 H100s costs around $6 per hour which makes the payback time a bit over a couple months.

6 comments

kingstnap

PhilippGille 7 months ago

> He has GLM 4.5 Running at ~100 Tokens per second.

GLM 4.5 Air, to be precise. It's a smaller 166B model, not the full 355B one.

Worth mentioning when discussing token throughput.

dnhkng 7 months ago
I'm downloading DeepSeek-V3.2-Speciale now at FP8 (reportedly Gold-medal performance in the 2025 International Mathematical Olympiad and International Olympiad in Informatics).
It will fit in system RAM, and as its mixture of experts and the experts are not too large, I can at least run it. Token/second speed will be slower, but as system memory bandwidth is somewhere around 5-600Gb/s, so it should feel OK.
- Gracana 7 months ago
  
  Check out "--n-cpu-moe" in llama.cpp if you're not familiar. That allows you to force a certain number of experts to be kept in system memory while everything else (including context cache and the parts of the model that every token touches) is kept in VRAM. You can do something like "-c128k -ngl 99 --n-cpu-moe <tuned_amt>" where you find a number that allows you to maximize VRAM usage without OOMing.

segmondy 7 months ago

This is about more. I can run 600B+ models at home. Today I was having a discussion with my wife and we asked ChatGPT a quick question, it refused because it can't generate the result based on race. I tried to prompt it to and it absolutely refused. I used my local model and got the answer I was looking for from the latest Mistral-Large3-675B. What's the cost of that?

nicman23 7 months ago

about the cost of your hardware lol

Deathmax 7 months ago

The author was running a quantised version of GLM 4.5 _Air_, not the full fat version. API pricing for that is closer to $0.2/$1.1 at the top end from z.ai themselves, half the price from Novita/SiliconFlow.