Comment by andai
1 day ago
So I'm hearing a lot of people running LLMs on Apple hardware. But is there actually anything useful you can run? Does it run at a usable speed? And is it worth the cost? Because the last time I checked the answer to all three questions appeared to be no.
Though maybe it depends on what you're doing? (Although if you're doing something simple like embeddings, then you don't need the Apple hardware in the first place.)
I was sitting in an airplane next to a guy on a MacBook pro something who was coding in cursor with a local llm. We got talking and he said there are obviously differences but for his style of 'English coding' (he described basically what code to write/files to change but in english, but more sloppy than code obviously otherwise he would just code) it works really well. And indeed that's what he could demo. The model (which was the OSS gpt i believe) did pretty well in his nextjs project and fast too.
Thanks. I call this method Power Coding (like Power Armor), where you're still doing everything except for typing out the syntax.
I found that for this method the smaller the model, the better it works, because smaller models can generally handle it, and you benefit more from iteration speed than anything else.
I don't have hardware to run even tiny LLMs at anything approaching interactive speeds, so I use APIs. The one I ended up with was Grok 4 Fast, because it's weirdly fast.
ArtificialAnalysis has a section "end to end" time, and it was the best there for a long time, tho many other models are catching up now.
The speed is fine, the models are not.
I found only one great application of local LLMs: spam filtering. I wrote a "despammer" tool that accesses my mail server using IMAP, reads new messages, and uses an LLM to determine if they are spam or not. 95.6% correct classification rate on my (very difficult) test corpus, in practical usage it's nearly perfect. gpt-oss-20b is currently the best model for this.
For all other purposes models with <80B parameters are just too stupid to do anything useful for me. I write in Clojure and there is no boilerplate: the code reflects real business problems, so I need an LLM that is capable of understanding things. Claude Code, especially with Opus, does pretty well on simpler problems, all local models are just plain dumb and a waste of time compared to that, so I don't see the appeal yet.
That said, my next laptop will be a MacBook pro with M5 Max and 128GB of RAM, because the small LLMs are slowly getting better.
I've tried out gpt-oss:20b on a MacBook Air (via Ollama) with 24GB of RAM. In my experience it's output is comparable to what you'd get out of older models and the openAI benchmarks seem accurate https://openai.com/index/introducing-gpt-oss/ . Definitely a usable speed. Not instant, but ~5 tokens per second of output if I had to guess.
This paper shows a use case running on Apple silicon that’s theoretically valuable:
https://pmc.ncbi.nlm.nih.gov/articles/PMC12067846/
Who cares if result is right / wrong etc as it will all be different in a year … just interesting to see a test of desktop class hardware go ok.
I have an MBP Max M3 with 64GB of RAM, and I can run a lot at useful speed (LLMs run fine, diffusion image models run OK although not as fast as they would on a 3090). My laptop isn't typical though, it isn't a standard MBP with a normal or pro processor.
Of course it depends what you’re doing.
Do you work offline often?
Essential.
I can definitely write code with a local model like Devstral small or a quantized granite, or a quantized deep-seek on an M1 Max w/ 64gb of ram.