← Back to context

Comment by embedding-shape

13 hours ago

> I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful

I've had mild success with GPT-OSS-120b (MXFP4, ends up taking ~66GB of VRAM for me with llama.cpp) and Codex.

I'm wondering if maybe one could crowdsource chat logs for GPT-OSS-120b running with Codex, then seed another post-training run to fine-tune the 20b variant with the good runs from 120b, if that'd make a big difference. Both models with the reasoning_effort set to high are actually quite good compared to other downloadable models, although the 120b is just about out of reach for 64GB so getting the 20b better for specific use cases seems like it'd be useful.

Are you running 120B agentic? I tried using it in a few different setups and it failed hard in every one. It would just give up after a second or two every time.

I wonder if it has to do with the message format, since it should be able to do tool use afaict.

  • This is a common problem for people trying to run the GPT-oss models themselves. Reposting my comment here:

    GPT-oss-120B was also completely failing for me, until someone on reddit pointed out that you need to pass back in the reasoning tokens when generating a response. One way to do this is described here:

    https://openrouter.ai/docs/guides/best-practices/reasoning-t...

    Once I did that it started functioning extremely well, and it's the main model I use for my homemade agents.

    Many LLM libraries/services/frontends don't pass these reasoning tokens back to the model correctly, which is why people complain about this model so much. It also highlights the importance of rolling these things yourself and understanding what's going on under the hood, because there's so many broken implementations floating around.

You are describing distillation, there are better ways to do it, and it was done in the past, Deepseek distilled onto Qwen.

I’ve a 128GB m3 max MacBook Pro. Running the gpt oss model on it via lmstudio once the context gets large enough the fans spin to 100 and it’s unbearable.