Comment by mechagodzilla

2 months ago

I've been running the 'frontier' open-weight LLMs (mainly deepseek r1/v3) at home, and I find that they're best for asynchronous interactions. Give it a prompt and come back in 30-45 minutes to read the response. I've been running on a dual-socket 36-core Xeon with 768GB of RAM and it typically gets 1-2 tokens/sec. Great for research questions or coding prompts, not great for text auto-complete while programming.

15 comments

mechagodzilla

christina97 1 month ago

Let's say 1.5tok/sec, and that your rig pulls 500 W. That's 10.8 tok/Wh, and assuming you pay, say 15c/kWh means you're paying in the vicinity of $13.8/mtok of output. Looking at R1 output costs on OpenRouter, it's costing about 5-7x as much as what you can pay for third party inference (which also produce tokens ~30x faster).

tyre 2 months ago

Given the cost of the system, how long would it take to be less expensive than, for example, a $200/mo Claude Max subscription with Opus running?

mechagodzilla 2 months ago
It's not really an apples-to-apples comparison - I enjoy playing around with LLMs, running different models, etc, and I place a relatively high premium on privacy. The computer itself was $2k about two years ago (and my employer reimbursed me for it), and 99% of my usage is for research questions which have relatively high output per input token. Using one for a coding assistant seems like it can run through a very high number of tokens with relatively few of them actually being used for anything. If I wanted a real-time coding assistant, I would probably be using something that fit in the 24GB of VRAM and would have very different cost/performance tradeoffs.
- mark_l_watson 2 months ago
  
  For what it is worth, I do the same thing you do with local models: I have a few scripts that build prompts from my directions and the contents of one or more local source files. I start a local run and get some exercise, then return later for the results.
  I own my computer, it is energy efficient Apple Silicon, and it is fun and feels good to do practical work in a local environment and be able to switch to commercial APIs for more capable models and much faster inference when I am in a hurry or need better models.
  Off topic, but: I cringe when I see social media posts of people running many simultaneous agentic coding systems and spending a fortune in money and environmental energy costs. Maybe I just have ancient memories from using assembler language 50 years ago to get maximum value from hardware but I still believe in getting maximum utilization from hardware and wanting to be at least the ‘majority partner’ in AI agentic enhanced coding sessions: save tokens by thinking more on my own and being more precise in what I ask for.
Workaccount2 2 months ago
Never, local models are for hobby and (extreme) privacy concerns.
A less paranoid and much more economically efficient approach would be to just lease a server and run the models on that.
- g947o 2 months ago
  
  This.
  I spent quite some time on r/LocalLLaMA and yet need to see a convincing "success story" of productively using local models to replace GPT/Claude etc.
  
  3 replies →
- FuckButtons 2 months ago
  
  I was able to run a batch job that lasted ~2 weeks of inference time on my m4 max by running it over night against a large dataset I wanted to mine. It cost me pennies in electricity and writing a simple python script as a scheduler.
dimava 2 months ago
Tokens will cost same on Mac and on API because electricity is not free
And you can only generate like $20 of tokens a month
Cloud tokens made on TPU will always be cheaper and waaay faster then anything you can make at home
- reissbaker 2 months ago
  
  This generally isn't true. Cloud vendors have to make back the cost of electricity and the cost of the GPUs. If you already bought the Mac for other purposes, also using it for LLM generation means your marginal cost is just the electricity.
  Also, vendors need to make a profit! So tack a little extra on as well.
  However, you're right that it will be much slower. Even just an 8xH100 can do 100+ tps for GLM-4.7 at FP8; no Mac can get anywhere close to that decode speed. And for long prompts (which are compute constrained) the difference will be even more stark.
  
  2 replies →
oceanplexian 2 months ago

It doesn't matter if you spend $200, $20,000, or $200,000 a month on an Anthropic Subscription.
None of them will keep your data truly private and offline.