Comment by jychang
9 days ago
> If instead the model is performing worse due to how much they had to shrink it just so it will fit on Cerebras hardware
They really should have just named it "gpt-5.3-codex-mini" (served by Cerebras). It would have made it clear what this model really is.
Not if you're suggesting that "(served by Cerebras)" should be part of the name. They're partnering with Cerebras and providing a layer of value. Also, OpenAI is "serving" you the model.
We don't know how they integrate with Cerebras hardware, but typically you'd pay a few million dollars to get the hardware in your own datacenter. So no, "served by Cerebras" is confusing and misleading.
Also "mini" is confusing because it's not analagous to gpt-5.1-codex vs gpt-5.1-codex-mini. Gpt-5.3-codex-spark is a unique, _experimental_ offering that doesn't fit the existing naming suffixes.
I don't understand what's wrong with "spark". It's friendly and evokes a sense of something novel, which is perfect.
If you want to know more about the model, read the first paragraph of the article. That information doesn't need to be hardcoded into the model name indefinitely. I don't see any "gpt-5.3-codex-nvidia" models.
Uh, that paragraph translated from "marketing bullshit" into "engineer" would be "we distilled the big gpt-5.3-codex model into a smaller size that fits on the 44GB of SRAM of a Cerebras WSE-3 multiplied by whatever tensor parallel or layer parallel grouping they're doing".
(Cerebras runs llama-3.3 70b on 4 WSE-3 units with layer parallelism, for example).
That's basically exactly what gpt-5.3-codex-mini would be.
> Also "mini" is confusing because it's not analagous to gpt-5.1-codex vs gpt-5.1-codex-mini.
So perhaps OpenAI intentionally picked the model's layer param count, MoE expert size, etc to fit onto the Cerebras machines. That's like saying "the DVD producer optimized this movie for you" (they just cropped and compressed it down to 4.7GB so it would fit on a DVD). Maybe the typical mini model is 100gb, and they made it 99gb instead or something like that. It's still analogous to gpt-5.3-codex-mini.
I'm underselling it a little bit, because it takes a bit more work than that to get models to run on Cerebras hardware (because they're so weird and un-GPU-like), but honestly if Cerebras can get Llama 3.1 405b or GLM 4.7 running on their own chips, it's not that much harder to have Cerebras get gpt-5.3-codex-mini running.
Uh, the combined offering (smaller model + ~800 tps on cerebras) is nothing like the previous mini offerings, and you're hallucinating details about their process of creating it.
Read more about how Cerebras hardware handles clustering. The limit is not 44 GB or 500GB. Each CS-3 has 1,200 TB of MemoryX, supporting up to ~24T parameter models. And up to 2,048 can be clustered.
3 replies →