Comment by tristanj
8 hours ago
They tried and failed. xAi made a mistake building Colossus 1 and ended up with heterogenous cluster of H100/H200/GB200 GPUs. This is a nightmare to train huge models on because each card has different specs, features, and hardware requirements. During gradient synchronization, a heterogeneous cluster would bottleneck on the slowest GPU (H100) so the faster GPUs would end up idling. They also probably ran into unexpected compatibility issues, which are difficult to resolve.
It makes more sense to use this cluster for inference, since they can segment the cluster by GPU type and avoid GPU mixing. xAI doesn't have enough inference customers so it makes sense to monetize this to companies that need inference compute such as Anthropic or Cursor.
Apparently xAI will try building SOTA models on Colossus 2, which will be built on Blackwell GPUs only.
How can something so obvious be overlooked by team building the data centre? Can't the sharding be uneven so that weaker GPUs still finish fast by taking on a smaller workload?
It's not like they had much of an option, when everybody was hoarding every GPU they could. For the second Colossus they could book future production, but the first one had to be built ASAP so xAI looked as a serious competitor in the AI space.
I imagine it involved a petulant billionaire screaming "Fucking build it. Build it NOW!" in response to expert feedback.