What would a good coding model to run on an M3 Pro (18GB) to get Codex like workflow and quality? Essentially, I am running out quick when using Codex-High on VSCode on the $20 ChatGPT plan and looking for cheaper / free alternatives (even if a little slower, but same quality). Any pointers?
Nothing. This summer I set up a dual 16GB GPU / 64GB RAM system and nothing I could run was even remotely close. Big models that didn't fit on 32gb VRAM had marginally better results but were at least of magnitude slower than what you'd pay for and still much worse in quality.
I gave one of the GPUs to my kid to play games on.
Short answer: there is none. You can't get frontier-level performance from any open source model, much less one that would work on an M3 Pro.
If you had more like 200GB ram you might be able to run something like MiniMax M2.1 to get last-gen performance at something resembling usable speed - but it's still a far cry from codex on high.
at the moment, I think the best you can do is qwen3-coder:30b -- it works, and it's nice to get some fully-local llm coding up and running, but you'll quickly realize that you've long tasted the sweet forbidden nectar that is hosted llms. unfortunately.
They are spending hundreds of billions of dollars on data centers filled with GPUs that cost more than an average car and then months on training models to serve your current $20/mo plan. Do you legitimately think there's a cheaper or free alternative that is of the same quality?
I guess you could technically run the huge leading open weight models using large disks as RAM and have close to the "same quality" but with "heat death of the universe" speeds.
"run" as in run locally? There's not much you can do with that little RAM.
If remote models are ok you could have a look at MiniMax M2.1 (minimax.io) or GLM from z.ai or Qwen3 Coder. You should be able to use all of these with your local openai app.
Max was always closed.
So the only way to run it is by using Qwen's API? No thanks. At least with Kimi and GLM, I can use Fireworks/whatever to avoid sending data to China.
When I looked earlier, Qwen claims to have DCs in Singapore and (I think?) the US but now I can't seem to find where I saw that.
Whether that means anything, I dunno.
afaiu not all of their models are open weight releases, this one so far is not open weight (?)
What would a good coding model to run on an M3 Pro (18GB) to get Codex like workflow and quality? Essentially, I am running out quick when using Codex-High on VSCode on the $20 ChatGPT plan and looking for cheaper / free alternatives (even if a little slower, but same quality). Any pointers?
Nothing. This summer I set up a dual 16GB GPU / 64GB RAM system and nothing I could run was even remotely close. Big models that didn't fit on 32gb VRAM had marginally better results but were at least of magnitude slower than what you'd pay for and still much worse in quality.
I gave one of the GPUs to my kid to play games on.
3 replies →
Short answer: there is none. You can't get frontier-level performance from any open source model, much less one that would work on an M3 Pro.
If you had more like 200GB ram you might be able to run something like MiniMax M2.1 to get last-gen performance at something resembling usable speed - but it's still a far cry from codex on high.
at the moment, I think the best you can do is qwen3-coder:30b -- it works, and it's nice to get some fully-local llm coding up and running, but you'll quickly realize that you've long tasted the sweet forbidden nectar that is hosted llms. unfortunately.
18gb RAM it is a bit tight
with 32gb RAM:
qwen3-coder and glm 4.7 flash are both impressive 30b parameter models
not on the level of gpt 5.2 codex but small enough to run locally (w/ 32gb RAM 4bit quantized) and quite capable
but it is just a matter of time I think until we get quite capable coding models that will be able to run with less RAM
They are spending hundreds of billions of dollars on data centers filled with GPUs that cost more than an average car and then months on training models to serve your current $20/mo plan. Do you legitimately think there's a cheaper or free alternative that is of the same quality?
I guess you could technically run the huge leading open weight models using large disks as RAM and have close to the "same quality" but with "heat death of the universe" speeds.
Z.ai has glm-4.7. Its almost as good for about $8/mo.
2 replies →
A local model with 18GB of ram that has the same quality has codex high? Yeah, nah mate.
The best could be GLN 4.7 Flash, and I doubt it's close to what you want.
"run" as in run locally? There's not much you can do with that little RAM.
If remote models are ok you could have a look at MiniMax M2.1 (minimax.io) or GLM from z.ai or Qwen3 Coder. You should be able to use all of these with your local openai app.
antigravity is solid and has a generous free tier.