Comment by bastawhiz
5 days ago
This isn't a good analysis, and it's because it keeps rounding everything up. He rounds up the cost of electricity by 10%. He has a range of power use, takes the high end (which is 2x the low end) and multiplies it by the inflated electricity cost.
But then they talk about using a newly purchased Mac to do the inference, running at full capacity, 24/7. Why would you do that? Apple silicon is fast but the author points out: you're only getting 10-40 tokens per second. It's not bad, but it's not meant for this!
It's comparing apples to oranges. Yeah, data centers don't pay residential electricity rates. Data centers use chips that are power efficient. Data centers use chips that aren't designed to be a Mac.
Apple silicon works out pretty good if you're not burning tokens 24/7/365 and you're not buying hardware specifically to do it. I use my Mac Studio a few times a week for things that I need it for, but I can run ollama on it over the tailnet "for free". The economics work when I'm not trying to make my Mac Studio behave like a H100 cluster with liquid cooling. Which should come as no surprise to anyone: more tokens per watt on hardware that's multi tenant with cheap electricity will pretty much always win.
Rounding everything down in the most optimistic setting got me to $0.40 per million tokens, and openrouter has the same model at $.38/mtok.
But once all that is done you still own a Mac in one case, and you don’t in the other, correct?
Even at just the electricity cost openrouter will be both
1) Roughly break-even to a little bit cheaper per token cost 2) Much, much, faster
So the cost of the mac barely even matters, it's just an extra cost beyond.
Sure, data center providers can pay lower rates.
The point of this article is that LLMs at home really don't make a ton of sense, unless you are willing to pay through the nose for privacy. There is absolutely no cost saving to be had.
If you're looking at your own datacenter as a larger corporate client, that could change.
There are also some providers that will contractually keep your data private, like AWS Bedrock or parts of Google/Azure (I don't know their stack names).
AWS even has AWS Secret Region and AWS Top Secret Region if you want to use LLMs on classified data.
You have to value privacy at a roughly absurd level to not want to use LLMs run efficiently at scale by someone else. For the home user, just the extra efficiency produced by batching requests from a large number of users in a datacenter in a real win.
Some of these companies are even selling tokens below cost to get marketshare. If someone will sell you a service for a dollar bill or three quarters, why wouldn't you take the three quarters?
1 reply →
Plus your privacy.
2 replies →
Not always. The calculations take its useful life expectancy as an input. If they estimate it correctly you have highly likelihood of it breaking, burning out or being woefully out of date by the end. At the 10 year window you are looking at losing support for security updates.
So if you are lucky you might end up with something that still runs but most folks won't find it particularly useful
Yea this; it’s the same reason why mortgaging is cheaper than renting
28 replies →
I’ll keep my data local over a $.02/mtok difference.
It’s more than just data locality. OpenRouter is faster, no? I have an M4 pro, and anything but the smallest dumbest models are unusably slow for interactive use. I personally haven’t yet found a good use case for offline/non-interactive LLM work locally.
8 replies →
What is it with AI SaaS naming themselves "openxyz" when there is 0% open about them?
They learnt from ooenai that naming yourself open-xyz doesn't actually require opening anything.
It's the next co-opted buzzword after "democratize".
1 reply →
It's how marketing works. If something is a problem they have to loudly claim to have fixed it. Look around the economy and you'll see lots of it. "Healthy" (high sugar) muesli bars, clean-diesel, surveillance wrapped up as keeping us safe. The modus operandi of marketing is to change minds about self evident things otherwise what is the point?
Also many have power even cheaper or even free unused surplus power with solar.
I don't do local inference other than hobby & learning reasons because electricity is so expensive where I am at.
The article makes no sense. I can't use OpenRouter as a general purpose computing device. Why are we comparing a whole computer to a single purpose SaaS?
They're responding to the people doing things like buying the most expensive Mac they can find specifically to do local inference for their AI agents.
Some do it to have control over their ability to use AI. Some do it because they think it will be cheaper to not have to pay a SaaS to generate tokens for them.
But for those interested in the latter case, it seems like it's not actually cheaper after all, at least at current prices. But then I don't expect prices to drastically jump because of how much competition there is in model development.
It's worth paying a premium for the privacy (assuming that llama.cpp and ollama aren't sending my sessions back to the cloud regardless...), and for the concerns about not getting a surprise bill.
2 replies →
You also have control over your costs. It is reasonable to assume that tokens will cost significantly more in the near to medium future as the market consolidates and subsidies decline.
3 replies →
No, that’s not the point. I think this is to help people who are thinking about getting a beefier Mac so they can run their LLMs on it too. Some in particular want a dedicated Mac Mini or Studio for this purpose. The breakdown, even if slightly flawed, offers a good insight into the economics of it.
For most people, they might be better off with OpenRouter models and providers supporting Zero Data Retention. On the cloud, that’s as good as it gets for privacy - your data is never retained beyond the life of the request.
> your data is never retained beyond the life of the request.
Like with OpenAI for a year?
” In June 2025, the court ordered OpenAI to retain its consumer and API customer chat logs indefinitely, including any that had been deleted, so they could be investigated […]”
https://www.techspot.com/news/109839-openai-no-longer-requir...
1 reply →
I think it's because there are a lot of people writing articles about the benefits of running local models. I think it's fair to say that there are daily threads on HN singing the praises or local inference. I also see people buying new hardware where the main trigger is ability to run local models.
But the people who want to do local inference are putting some amount of value on privacy that’s not captured by the raw monetary value so just comparing the price is somewhat beside the point, it’s also true that, if you have eg a Mac and you use that as your main computing device then you would have spent money on it anyway, so you can’t even really compare its value to spend on something that’s not general purpose.
3 replies →
using it 24/7 brings the average cost down, not up.
the less you use local LLM, the less sense it makes since you paid a lot for hardware you don't use
That's the point: why would you buy a device that's specifically not optimized to be used for 24/7 inference? It's expensive hardware that's not designed to be used in that situation! The power use for inference isn't especially good and you're not getting even a fraction of the benefit from the hardware that you're paying for.
> why would you buy a device that's specifically not optimized to be used for 24/7 inference
because it costs $1k-$2k instead of $10k-30k+ for optimized devices
1 reply →
Good question but people are doing it anyway. It's a fact that right now tons of people are buying Mac Minis specifically for this use case, to treat them as their personal data center for agents. The concept of "power use for inference" is foreign. Those people are the ones that motivated this blog post I think.
The hardware has multiple uses for the same cost. The pay-per-use server does not.
The author isn't pricing in the multiple uses. You either compare it apples to apples or you don't. If you're using the machine for general purpose computing on top of inference then the amortized hardware costs are pointless to measure. This is exactly what I said.
1 reply →
Not sure where 40 tokens per second is coming from. I’ve seen 95-100 tokens per second on M5 Max 128GB running Gemma 4 31B. I’ve done experiments where it is faster than Claude Opus 4.5 for the same prompts.
Wild. That must be like a 5,000 USD laptop.
can you provide your configurations pls ?
It's actually a bit faster than that now it seems, about 112 tok/sec.
Configuration:
Gemma 4 31B Instruct Q6K Context size 40960 LM Studio 0.4.13+1 Metal llama.cpp v2.14.0 LM Studio MLX (Apple M5) v1.6.0
Here are my results:
prompt eval time = 32545.36 ms / 5625 tokens ( 5.79 ms per token, 172.84 tokens per second) eval time = 20227.99 ms / 310 tokens ( 65.25 ms per token, 15.33 tokens per second) total time = 52773.35 ms / 5935 tokens
This was for interacting with a local MCP service, running a tool that returns a ~20KB text file to the agent to add to the chat context.
I'm seeing about the same number of tokens/second on an M2 Ultra that I have access to (also with 128GB of memory).
This is surely apples-to-oranges to the OP results (and I don't spend a great deal of time benchmarking these things, so my methodology might be lacking), but it's interesting seeing okay performance for a top open model. For most use, however, I find Gemma 4 26B A4B (Q6K) to be good enough (esp. for MCP calling) and much much faster (~1,200 tokens/second).
Actually, figuring it on generating tokens 24/7 is the best case scenario. if you figure it at 8 hours a day of actual use, you still have the fixed cost of the hardware being the highest portion of the budget, but now you generate 1/3 the tokens so you triple that cost per token.
Rounded up, yes, and oddly inefficient for someone obsessed with inefficiency. One could buy a brand new 64gb M5 macbook for well over 4k. Another could buy a scratched up but functioning M1 Max 64gb off of ebay for a little over 1k—and somehow get the same 10-20 t/s with 31b that the author does with an M5. Or better yet, have a frontier model do the planning and judging, and have a local MOE model execute at 50 t/s. All of this achievable by a former English major with too much free time.
I have an M1 Pro, and a M4 & M5 max to play with at work and the speed difference is very significant between all 3 machines, the M1 Pro is far slower, and the M5 is significantly faster than the M4. And a windows 3090 beats all of them but eats twice the amount of power per token. This is all running the same 24GB memory friendly model with LM studio.
The real reason this comparison makes no sense is that only a vanishingly small fraction of people seriously using ai to code would seriously use a model so far from the top models (including open source ones).
He should compare his MacBook to Open Router on Kimi 2.6 1.1T or GLM 5.1 (754B), at bfloat16 precision, which he can't ofc.
But it furthers his point that things like open router are a better idea, which is not surprising.
> Yeah, data centers don't pay residential electricity rates.
There are 2 caveats here:
Some places have higher prices for industrial than residential power as residential one might be subsidied by govt.
And DC also pay for cooling, which residential will only effectively pay if they have AC and is hot outside. So power rates are some multiply of industrial pricing.
Generally you don't build a data center in a place that doesn't sell you electricity for cheap
Still for purpose of calculation you still need to calculate the price of both power and removing heat created by that power.
Boss, I make 16.50 per hour, say 15, I work 36 hours, say 35, say 500 per week, say 4 weeks per month, that's only about 2000! Don't you agree I need a raise?
We also have no idea what it actually costs Anthropic. This could be wildly subsidized and actually Apple Silicon is more cost effective.
Your post makes sense if you bought the hardware for other reasons, and maybe run models occasionally as a novelty.
That isn't the case for many, though, and there is a whole social media space where people are hyping up the latest homebrew options for running models, believing it frees them from the yoke of big AI.
Millions of people are buying big $ maxed-out hardware like the Mac Studios or DGX specifically to run LLMs. Someone rationally running the numbers is a good thing.
Let's not get ahead of ourselves. Millions, really? I can believe there are a lot of enthusiasts doing this, but "millions" needs a citation.
This is HN; it has probably never occurred to half the people here that the average person even in first world countries doesn't even have the financial capacity to make an impulse five-figure USD purchase, even if on credit.
1 reply →
> Millions of people are buying big $ maxed-out hardware like the Mac Studios or DGX specifically to run LLMs.
What's your source for this?
Not just one, but two replies completely hung up on that. Like, why even reply when you already saw the other guy doing the same tired pedantry? Just wanted to feel like you contributed?
Ignoring that it was just tosser hyperbole (that absolutely zero reasonable people need to question), yes, enormous numbers of people are buying GPUs or hardware with the explicit goal of running local LLMs, and social media is full of people hyping various setups and models. Mac Minis are almost impossible to find, and that alone is selling at a clip of about 300,000 every four months. Large memory GPUs are basically a myth at this point. All so people can pay more to get a worse result than commercial options, which is precisely the point of the submission.
These local setups only ever make sense if you have something that confidential, or you're doing something that ToS of the majors would ban you for.
Now given this pedantry horseshit, you'll probably demand that I specifically show a citation on DGX or Studio sales, which...rofl.
Honestly, I don't even see my Macbook Pro costing me anywhere near as much as using any of these AI services, but maybe I'm just not seeing a significant increase in my power bill to notice? I am the power user who uses Claude Max pretty much all the time to prototype ideas, and build things I actually use, and has given me a lot of value, I work full time and have a family to raise and care for, my free coding time is mostly limited to ideas. Now I can draft a plan with detail, review the code, run the code, test it, and use software custom tailored to my needs.
nothing about the current data center craze looks efficient.
Whether you think building data centers or not is a good idea it's inarguable that the per-token efficiency (power, hardware, etc) is FAR higher in a data center. That's literally what it's designed for.
im talking per value. look at the efgiency of chinese open source models; then look at SOTA sucking gigawatts, then the proposals.
America is basically proposing AI using the equivalent bloatware of Windows 11.
1 reply →
Probably because lots of data centres are being built (or half-built) which are sitting idle.
If there are datacenters sitting idle right now then you could probably make a lot of money selling that capacity to Anthropic at this point...
If you have racks of idle H100s, you are doing a terrible job of running a business.
[dead]