OpenAI Leaks 120B Open Model on Hugging Face

4 days ago (twitter.com)

80 comments

skadamat

I don’t get the hype with OpenAI OSS, they would never make a model better than their proprietary models open source, and the other open source models beat GPT and family so why the wait?

granitepail 4 days ago
While the benchmarks all say open source models Kimi and Qwen outpace proprietary models like GPT 4.1, GPT 4o, or even o3, my (and just about everyone I know's) boots on the ground experience suggests they're not even close. This is for tool calling agentic tasks, like coding, but also in other contexts (research, glue between services, etc). I feel like it's worth putting that out there--it's pretty clear there's a lot of benchmark hacking happening. I'm not really convinced it's purposeful/deceitful, but it's definitely happening. Qwen3 Coder, for example, is basically incompetent for any real coding tasks and frequently gets caught in death spirals of bad tool calls. I try all the OSS models regularly, because I'm really excited for them to get better. Right now Kimi K2 is the most usable one, and I'd rate it at a few ticks worse than GPT 4.1.
- daft_pink 4 days ago
  
  isn’t the problem with the benchmarks that most people running ai locally are running way lower weights?
  i have an m4 studio with a lot of unified memory and i’m still no where near running a 120b model. i’m at like 30b
  apple or nvidia’s going to have to sell 1.5 tb ram machines before benchmark performance is going to be comparable
  Plus when you use claude or openai, these days it’s performing google searches etc that my local model isn’t doing.
  
  16 replies →
- n_kr 4 days ago
  
  It may be the way I use it, but qwen3-coder (30b with ollama) is actually helping me with real world tasks. Its a bit worse than big models for the way I use it, but absolutely useful. I do use ai tools with very specific instructions though, like file paths, line numbers if I can, and specific direction about what to do, my own tools, etc. so that may be why I don't see such a huge difference from big models.
  I should try Kimi K2 too.
  
  2 replies →
- torginus 4 days ago
  
  Not sure about benchmarks, but I did use Deepseek when it was novel and cool for a variety of tasks before going back to Claude, and in my experience it was OK, not significantly worse for what I use these models for (writing code small functions at a time, learning about libraries etc.), tham closed stuff at the time.
- lossolo 4 days ago
  
  While that's true for some open source models, I find DeepSeek R1 685B 0528 to be competitive with O3 in my production tests, I've been using it interchangeably for tasks I used to handle with Opus or O3.
- jimbo808 4 days ago
  
  I would have assumed anyone frequenting HN would have figured out by now that benchmarks are 100% bullshit. I guess I'd be wroing.
  
  7 replies →
Topfi 4 days ago

Pure performance isn't necessarily everything. Context window, speed and local use are just some of the upsides this model may have. We still know next to nothing so anything is possible, but if it is an MoE at 120B, that could enable some interesting local use cases, even if it less capable than e.g. Deepseek V3, simply by running on more hardware/at higher tokens/sec. GPT-4.1s code focus has also shown that OpenAI does have a knack for models with a more narrow use case, maybe this will do well in specific tasks. More so since GPT-4.1 was that much better than the massive GPT-4.5, I am cautiously optimistic.
Even if it does poorly in all areas (like Llama 4 [0]), there is still a lot the community and industry can learn from even an uncompetitive model.
[0] Llama 4 technically has a massive 10M token context as a differentiator, however in my experience, it is not reliably usable beyond 100k.
jstummbillig 4 days ago

I don't see how it would be in OpenAIs selfish interest to release an open source model that sucks. Unless you can cohesively explain how that would work in their favor, it seems a lot smarter to assume that they won't.
PeterStuer 4 days ago

They might not release a better model than their proprietary models, but others can build on and tinker with these open models to improve and specialize them.
Another reason people are 'hyped' for open models is that access to them can not be taken away or price gauged at the whim of the provider, and that their use can not be restricted in arbitrary ways, although I'm sure that on the latter part they will have a go at it through regulation.
Grab'em while you can.
rdtsc 4 days ago
> they would never make a model better than their proprietary models open source
Not their proprietary model, but maybe other open source models, or closed source models of their competitors. That way they can first ensure they are the only player on both sides, and then can kneecap their open source models just enough to drive the revenue to their proprietary one.
- 44za12 4 days ago
  
  Making a model better than proprietary models is in fact making a model better than their closed source models if you believe the benchmarks.
  
  1 reply →
jphoward 4 days ago
I think they could release non-agentic models that are as good as 4o, and have almost no repercussions on sales tbh.
I have Ollama installed (only a small proportion of their clients would have a large enough GPU for this) and have download deepseek and played with it, but I still pay for an OpenAI subscription because I want the speed of a hosted model, and never mind the luxuries of things like Codex's diffs/pull request support, agents on new models, deep research etc. - I use them all at least weekly.
- 44za12 4 days ago
  
  I pay for Cursor, OpenAI and kimi (to use with Claude Code), OpenAI is good with quickly refining my thoughts, Cursor’s subscription I’m reconsidering to cancel bought it for Claude but the rate limits are making it impossible for me to find it useful. Kimi is what truly surprises me, Claude code shows this conversation costed you $500 (based on Opus usage which is mapped to kimi k2) while I’ve barely spent $2. I have Ollama as well, majorly to quickly test small models that could be improved for our usecase through finetuning.
  
  9 replies →
- vineyardmike 4 days ago
  
  They would definitely have sales repercussions, but it might be worth it.
  They are fully trying to be a consumer product, developer services be damned. But they can’t just get rid of the API because it’s a good incremental source of revenue, and thanks to the Microsoft deal, all that revenue would end up in Azure. Maintaining their API is basically just a way to get a slice of that revenue.
  But if they open sourced everything, it might sour the relationship more with Microsoft, who would lose azure revenue and might be willing to part ways. It would also ensure that they compete on consumer product quality not (directly) model quality. At this point, they could basically put any decent model in their app and maintain the user base, they don’t actually need to develop their own.

Mars008 4 days ago

Finally OpenAI is about to open something and nobody is happy on HN. Would be interesting to see it thinking.

m_ke 4 days ago

Would be interesting if this was a coding focused model optimized for Mac inference. Would be a great way to undercut Anthropic.

Pretty much give away Sonnet level coding model and have it work with GPT-5 for harder tasks / planning.

CharlesW 4 days ago
Out of curiosity, have you tried running Qwen3 Coder 30B locally? https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-...
- stavros 4 days ago
  
  Not the GP, but I haven't, how is it? I use Claude Code with Sonnet, does Qwen3 compare?
  
  1 reply →

arnaudsm 4 days ago

Who's the target of 120B open-weights models? You can only run this in the cloud, is it just PR?

I wish they released a nano model for local hackers instead

jlokier 4 days ago

You can run models the size of this one locally, even on a laptop, it's just not a great experience compared with an optimised cloud service. But it is local.
The size in bytes of this 120B model is about 65 GB according to the screenshot, and elsewhere it's said to be trained in FP4, which matches.
That makes this model small enough to run locally on some laptops without reading from SSD.
The Apple M2 Max 96GB from January 2023, which is two generations old now, has enough GPU-capable RAM to handle it, albeit slowly. Any PC with 96 GB of RAM can run it on the CPU, probably more slowly. Even a PC with less than 64 GB of RAM can run it but it will be much slower due to having to read from the SSD constantly.
If it's a 20B MoE, it will read about one fifth of the data per token, making it about 5x faster than a 120B FP4 non-MoE would be, but it still needs all the data readily available for multiple tokens.
Alternatively, someone can distill and/or quantize the model themselves to make a smaller model. These things can be done locally, even on a CPU if necessary if you don't mind how long it takes to produce the smaller model. Or on a cloud machine rented long enough to make the smaller model, which you can then run locally.
segmondy 4 days ago
You can run it locally too. Below are a few of my local models, this is coming in light compared to them. At Q4 it's ~60B. Furthermore being a MoE, most of it can be in system memory and only the shared experts needs to go to GPU, provided you have a decent system with decent memory bandwidth, you can get decent performance. I'm running on GPUs, folks with Apple can run this if they have enough ram with minimal effort.
126G /llmzoo/models/Qwen3-235B-InstructQ4 126G /llmzoo/models/Qwen3-235B-ThinkingQ4 189G /llmzoo/models/Qwen3-235B-InstructQ6 219G /llmzoo/models/glm-4.5-air 240G /llmzoo/models/Ernie 257G /llmzoo/models/Qwen3-Coder-480B 276G /llmzoo/models/DeepSeek-R1-0528-UD-Q3_K_XL.b.gguf 276G /llmzoo/models/DeepSeek-TNG 276G /llmzoo/models/DeepSeek-V3-0324-UD-Q3_K_XL.gguf 422G /llmzoo/models/KimiK2
kccqzy 4 days ago

They are probably hoping that someone else will distill it into smaller models, much like DeepSeek released a giant 671B model but there are useful distillations down to 30B.
xandrius 4 days ago

For people who run stuff on the cloud?
oldge 4 days ago
This sized model is trivial to run on a modern workstation
- dmonitor 4 days ago
  
  You'll have to define modern workstation for me, because I was under the impression that unless you've purpose-built your machine to run LLMs, this size model is impossible.
  
  4 replies →
152334H 4 days ago

They have a 20b for GPU poors, too.
I will be running the 120B on my 2x4090-48GB, though.

Nerd_Nest 4 days ago

Whoa, 120B? That’s huge.

qeternity 4 days ago
120B MoE. The 20B is dense.
As far as dense models go, it’s larger than many but Mistral has released multiple 120B dense models, not to mention Llama3 405B.
- nivvis 2 hours ago
  
  for posterity, since shown that is it actually MoE
  > 21B parameters with 3.6B active parameters
- sciencesama 4 days ago
  
  How much ram do you need to run this !!??
  
  2 replies →

natas 4 days ago

okay, so where do I download this now that it's been removed from huggingface?

yieldcrv 4 days ago

*uploads

ipsum2 4 days ago
Accidentally reveals.
- yieldcrv 4 days ago
  
  accidentally on purpose
  
  1 reply →
seydor 4 days ago
But with an NDA so that leaks can be legit
- kristianp 4 days ago
  
  What's the chances someone under NDA has leaked the url to the xeeter in question?