← Back to context

Comment by HumanOstrich

11 days ago

Not if you're suggesting that "(served by Cerebras)" should be part of the name. They're partnering with Cerebras and providing a layer of value. Also, OpenAI is "serving" you the model.

We don't know how they integrate with Cerebras hardware, but typically you'd pay a few million dollars to get the hardware in your own datacenter. So no, "served by Cerebras" is confusing and misleading.

Also "mini" is confusing because it's not analagous to gpt-5.1-codex vs gpt-5.1-codex-mini. Gpt-5.3-codex-spark is a unique, _experimental_ offering that doesn't fit the existing naming suffixes.

I don't understand what's wrong with "spark". It's friendly and evokes a sense of something novel, which is perfect.

If you want to know more about the model, read the first paragraph of the article. That information doesn't need to be hardcoded into the model name indefinitely. I don't see any "gpt-5.3-codex-nvidia" models.

Uh, that paragraph translated from "marketing bullshit" into "engineer" would be "we distilled the big gpt-5.3-codex model into a smaller size that fits on the 44GB of SRAM of a Cerebras WSE-3 multiplied by whatever tensor parallel or layer parallel grouping they're doing".

(Cerebras runs llama-3.3 70b on 4 WSE-3 units with layer parallelism, for example).

That's basically exactly what gpt-5.3-codex-mini would be.

> Also "mini" is confusing because it's not analagous to gpt-5.1-codex vs gpt-5.1-codex-mini.

So perhaps OpenAI intentionally picked the model's layer param count, MoE expert size, etc to fit onto the Cerebras machines. That's like saying "the DVD producer optimized this movie for you" (they just cropped and compressed it down to 4.7GB so it would fit on a DVD). Maybe the typical mini model is 100gb, and they made it 99gb instead or something like that. It's still analogous to gpt-5.3-codex-mini.

I'm underselling it a little bit, because it takes a bit more work than that to get models to run on Cerebras hardware (because they're so weird and un-GPU-like), but honestly if Cerebras can get Llama 3.1 405b or GLM 4.7 running on their own chips, it's not that much harder to have Cerebras get gpt-5.3-codex-mini running.

  • Uh, the combined offering (smaller model + ~800 tps on cerebras) is nothing like the previous mini offerings, and you're hallucinating details about their process of creating it.

    Read more about how Cerebras hardware handles clustering. The limit is not 44 GB or 500GB. Each CS-3 has 1,200 TB of MemoryX, supporting up to ~24T parameter models. And up to 2,048 can be clustered.

    • Yeah, it's pretty clear you're loud mouthed and don't know anything about distilling ML models or anything Cerebras. Distilling ML models into smaller mini versions is basic stuff. How do you think Qwen 3 235b and Qwen 3 30b were made? Or GLM 4.5 355b vs GLM 4.5 Air 105b? Or Meta Llama 4 Maverick and Scout? And everyone knows that the reason Cerebras never served Deepseek R1 or Kimi K2 or any other model bigger than ~500B is because their chips don't have enough memory. People have been begging Cerebras to serve Deepseek forever now, and they never actually managed to do it.

      Cerebras doesn't run inference from MemoryX, the same way no other serious inference provider runs inference off of system RAM. MemoryX is connected to the CS-3 over ethernet! It's too slow. MemoryX is only 150GB/sec for the CS-3![1] If you're running inference at 800tokens/sec, with 150GB/sec that means each token can only load 0.18GB of params. For obvious reasons, I don't think OpenAI is using a 0.18B sized model.

      The limit is 44GB for each WSE-3. [2] That's how much SRAM a single WSE-3 unit has. For comparison, a Nvidia H100 GPU has 80GB, and a DGX H100 server with 8 GPUs have 640GB of VRAM. Each WSE-3 has 44GB to play around with, and then if you have each one handling a few layers, you can load larger models. That's explicitly what Cerebras says they do: "20B models fit on a single CS-3 while 70B models fit on as few as four systems." [3]

      You're reading marketing material drivel about training models that NOBODY uses Cerebras for. Basically nobody uses Cerebras for training, only inference.

      [1] https://www.kisacoresearch.com/sites/default/files/documents... "The WSE-2’s 1.2Tb/s of I/O bandwidth is used for [...] transmitting gradients back to the MemoryX service." That quote is about WSE-2/CS-2, but the CS-3 spec lists the same System I/O: 1.2 Tb/s (12×100 GbE).

      [2] https://cdn.sanity.io/images/e4qjo92p/production/50dcd45de5a... This really makes it obvious why Cerebras couldn't serve Deepseek R1. Deepseek is 10x larger than a 70b model. Since they don't do tensor parallelism, that means each chip has to wait for the previous one to finish before it can start. So not only is it 10x more memory consumption, it has to load all that sequentially to boot. Cerebras' entire market demands 1000 tokens per second for the much higher price that they charge, so there's no profit in them serving a model which they can only do 500 tokens/sec or something slow like that.

      [3] https://www.cerebras.ai/blog/introducing-cerebras-inference-...

      2 replies →