Comment by jychang

9 days ago

Uh, that paragraph translated from "marketing bullshit" into "engineer" would be "we distilled the big gpt-5.3-codex model into a smaller size that fits on the 44GB of SRAM of a Cerebras WSE-3 multiplied by whatever tensor parallel or layer parallel grouping they're doing".

(Cerebras runs llama-3.3 70b on 4 WSE-3 units with layer parallelism, for example).

That's basically exactly what gpt-5.3-codex-mini would be.

> Also "mini" is confusing because it's not analagous to gpt-5.1-codex vs gpt-5.1-codex-mini.

So perhaps OpenAI intentionally picked the model's layer param count, MoE expert size, etc to fit onto the Cerebras machines. That's like saying "the DVD producer optimized this movie for you" (they just cropped and compressed it down to 4.7GB so it would fit on a DVD). Maybe the typical mini model is 100gb, and they made it 99gb instead or something like that. It's still analogous to gpt-5.3-codex-mini.

I'm underselling it a little bit, because it takes a bit more work than that to get models to run on Cerebras hardware (because they're so weird and un-GPU-like), but honestly if Cerebras can get Llama 3.1 405b or GLM 4.7 running on their own chips, it's not that much harder to have Cerebras get gpt-5.3-codex-mini running.

4 comments

jychang

HumanOstrich 9 days ago

Uh, the combined offering (smaller model + ~800 tps on cerebras) is nothing like the previous mini offerings, and you're hallucinating details about their process of creating it.

Read more about how Cerebras hardware handles clustering. The limit is not 44 GB or 500GB. Each CS-3 has 1,200 TB of MemoryX, supporting up to ~24T parameter models. And up to 2,048 can be clustered.

jychang 9 days ago
Yeah, it's pretty clear you're loud mouthed and don't know anything about distilling ML models or anything Cerebras. Distilling ML models into smaller mini versions is basic stuff. How do you think Qwen 3 235b and Qwen 3 30b were made? Or GLM 4.5 355b vs GLM 4.5 Air 105b? Or Meta Llama 4 Maverick and Scout? And everyone knows that the reason Cerebras never served Deepseek R1 or Kimi K2 or any other model bigger than ~500B is because their chips don't have enough memory. People have been begging Cerebras to serve Deepseek forever now, and they never actually managed to do it.
Cerebras doesn't run inference from MemoryX, the same way no other serious inference provider runs inference off of system RAM. MemoryX is connected to the CS-3 over ethernet! It's too slow. MemoryX is only 150GB/sec for the CS-3![1] If you're running inference at 800tokens/sec, with 150GB/sec that means each token can only load 0.18GB of params. For obvious reasons, I don't think OpenAI is using a 0.18B sized model.
The limit is 44GB for each WSE-3. [2] That's how much SRAM a single WSE-3 unit has. For comparison, a Nvidia H100 GPU has 80GB, and a DGX H100 server with 8 GPUs have 640GB of VRAM. Each WSE-3 has 44GB to play around with, and then if you have each one handling a few layers, you can load larger models. That's explicitly what Cerebras says they do: "20B models fit on a single CS-3 while 70B models fit on as few as four systems." [3]
You're reading marketing material drivel about training models that NOBODY uses Cerebras for. Basically nobody uses Cerebras for training, only inference.
[1] https://www.kisacoresearch.com/sites/default/files/documents... "The WSE-2’s 1.2Tb/s of I/O bandwidth is used for [...] transmitting gradients back to the MemoryX service." That quote is about WSE-2/CS-2, but the CS-3 spec lists the same System I/O: 1.2 Tb/s (12×100 GbE).
[2] https://cdn.sanity.io/images/e4qjo92p/production/50dcd45de5a... This really makes it obvious why Cerebras couldn't serve Deepseek R1. Deepseek is 10x larger than a 70b model. Since they don't do tensor parallelism, that means each chip has to wait for the previous one to finish before it can start. So not only is it 10x more memory consumption, it has to load all that sequentially to boot. Cerebras' entire market demands 1000 tokens per second for the much higher price that they charge, so there's no profit in them serving a model which they can only do 500 tokens/sec or something slow like that.
[3] https://www.cerebras.ai/blog/introducing-cerebras-inference-...
- aurareturn 8 days ago
  
  Yes. In order to serve 1k/s, they must be fitting the entire model on SRAM and not reaching out to off chip RAM. This means they’re likely chaining multiple wafer chips together to serve this model or they shrunk the model to fit one wafer chip. It’s uneconomical for many use cases but for highly valuable tasks, it could be worth it.
  This is one area Nvidia chips have not been able to do, ultra fast, ultra high value tasks. Hence, the Grog acquisition.
- HumanOstrich 8 days ago
  
  Yea, it's pretty clear you're loudmouthed and an aggressively arrogant know-it-all (at least you think). You keep moving the goalposts too. First you're acting like they can't run models that don't fit in 44GB or 4x44GB. Then you say they can "only" run a larger model at 500 tps but that wouldn't be profitable.. Lol
  Cerebras CURRENTLY serves GLM-4.7. I've used it through their API. Look up how big it is. 1,000-1,700 tps. https://www.cerebras.ai/blog/glm-4-7
  Not interested in further conversation, so have a nice day! You can go ahead and get in the last word though.