Comment by hasperdi

2 months ago

I bought a second‑hand Mac Studio Ultra M1 with 128 GB of RAM, intending to run an LLM locally for coding. Unfortunately, it's just way too slow.

For instance, an 4‑bit quantized model of GLM 4.6 runs very slowly on my Mac. It's not only about tokens per second speed but also input processing, tokenization, and prompt loading; it takes so much time that it's testing my patience. People often mention about the TPS numbers, but they neglect to mention the input loading times.

27 comments

hasperdi

jwitthuhn 2 months ago

At 4 bits that model won't fit into 128GB so you're spilling over into swap which kills performance. I've gotten great results out of glm-4.5-air which is 4.5 distilled down to 110B params which can fit nicely at 8 bits or maybe 6 if you want a little more ram left over.

hasperdi 2 months ago
Correction, my GLM-4.6 models are not Q4, I can only run lower ones eg:
- https://huggingface.co/unsloth/GLM-4.6-GGUF/blob/main/GLM-4.... - 84GB, Q1 - https://huggingface.co/unsloth/GLM-4.6-REAP-268B-A32B-GGUF/t... - 92GB, Q2
I ensure that there are enough RAM leftover ie limited context window setting, so no swapping.
As for GLM-4.5-Air, I run that daily, switching between noctrex/GLM-4.5-Air-REAP-82B-A12B-MXFP4_MOE-GGUF and kldzj/gpt-oss-120b-heretic
- andai 2 months ago
  
  Are you getting any agentic out of gpt-oss-120b?
  I can't tell if it's some bug regarding message formats or if it's just genuinely giving up, but it failed to complete most tasks I gave it.
  
  2 replies →

mechagodzilla 2 months ago

I've been running the 'frontier' open-weight LLMs (mainly deepseek r1/v3) at home, and I find that they're best for asynchronous interactions. Give it a prompt and come back in 30-45 minutes to read the response. I've been running on a dual-socket 36-core Xeon with 768GB of RAM and it typically gets 1-2 tokens/sec. Great for research questions or coding prompts, not great for text auto-complete while programming.

christina97 2 months ago

Let's say 1.5tok/sec, and that your rig pulls 500 W. That's 10.8 tok/Wh, and assuming you pay, say 15c/kWh means you're paying in the vicinity of $13.8/mtok of output. Looking at R1 output costs on OpenRouter, it's costing about 5-7x as much as what you can pay for third party inference (which also produce tokens ~30x faster).
tyre 2 months ago
Given the cost of the system, how long would it take to be less expensive than, for example, a $200/mo Claude Max subscription with Opus running?
- mechagodzilla 2 months ago
  
  It's not really an apples-to-apples comparison - I enjoy playing around with LLMs, running different models, etc, and I place a relatively high premium on privacy. The computer itself was $2k about two years ago (and my employer reimbursed me for it), and 99% of my usage is for research questions which have relatively high output per input token. Using one for a coding assistant seems like it can run through a very high number of tokens with relatively few of them actually being used for anything. If I wanted a real-time coding assistant, I would probably be using something that fit in the 24GB of VRAM and would have very different cost/performance tradeoffs.
  
  1 reply →
- Workaccount2 2 months ago
  
  Never, local models are for hobby and (extreme) privacy concerns.
  A less paranoid and much more economically efficient approach would be to just lease a server and run the models on that.
  
  5 replies →
- dimava 2 months ago
  
  Tokens will cost same on Mac and on API because electricity is not free
  And you can only generate like $20 of tokens a month
  Cloud tokens made on TPU will always be cheaper and waaay faster then anything you can make at home
  
  3 replies →
- oceanplexian 2 months ago
  
  It doesn't matter if you spend $200, $20,000, or $200,000 a month on an Anthropic Subscription.
  None of them will keep your data truly private and offline.

robotswantdata 2 months ago

Yes they conveniently forget about disclosing prompt processing time. There is an affordable answer to this, will be open sourcing the design and sw soon.

hedgehog 2 months ago

Have you tried Qwen3 Next 80B? It may run a lot faster, though I don't know how well it does coding tasks.

hasperdi 2 months ago

I did, it works well... although it is not good enough for agentic coding

smcleod 2 months ago

Need the M5 (max/ultra next year) with it's MATMUL instruction set that massively speeds up the prompt processing.

Reubend 2 months ago

Anything except a 3bit quant of GLM 4.6 will exceed those 128 GB of RAM you mentioned, so of course it's slow for you. If you want good speeds, you'll at least need to store the entire thing in memory.

nimchimpsky 2 months ago

[dead]