Comment by sfink
3 days ago
I think you're missing the point. The comparison is not between 93% and 92%. The comparison is between what they're getting (93%) and what you'd get if you scaled up the usual process to the core size they're using (0%). They are doing something different (namely: a ~whole wafer chip) that isn't possible without massively boosting the intra-chip redundancy. (The usual process stops working once you no longer have any extra dies to discard.)
> Despite having built the world’s largest chip, we enable 93% of our silicon area, which is higher than the leading GPU today.
The important part is building the largest chip. The icing on the top is that the enablement is not lower. Which it would be without the routing-to-spare-cores magic sauce.
And the differing terminology is because they're talking about differing things? You could call an SM a core, but it kind of contains (heterogeneous) cores itself. (I've no idea whether intra-SM cores can be redundant to boost yield.) A die is the part you break off and build a computer out of, it may contain a bunch of cores, a wafer can be broken up into multiple dies but for Cerebras it isn't.
If NVIDIA were to go and build a whole-wafer die, they'd do something similar. But Cerebras did it and got it to work. NVIDIA hasn't gotten into that space yet, so there's no point in building a product that you can't sell to a consumer or even a data center that isn't built around that exact product (or to contain a Balrog).
I think I'll still stand by my viewpoint. They said:
> On the Cerebras side, the effective die size is a bit smaller at 46,225mm2. Applying the same defect rate, the WSE-3 would see 46 defects. Each core is 0.05mm2. This means 2.2mm2 in total would be lost to defects.
So ok they claim that they should see (46225-2.2)/46225 = 99.995%. Doing the same math for their Nvidia numbers it's 99.4%. And yet in practice neither approach got to these numbers. Nowhere near it. I just feel like the whole article talks about all this theory and numbers and math of how they're so much better but in practice it's meaningless.
So what I'm not seeing is why it'd be impossible for all the H100s on a wafer to be interconnected and call it a day. You'd presumably get 92/93 = 98.9% of the performance and, here's the kicker, no need to switch to another architecture. I didn't know where your 0% number came from. Nothing about this article says that a competitor doing the same scaling to wafer scale would get 0%, just a marginal decrease in how many cores made it through fab.
Fundamentally I am not convinced from this article that Cerebras has done something in their design that makes this possible. All I'm seeing is that it'd perform 1% faster.
Edit: thinking a bit more on it, to me it's like they said TSMC has a guy with a sledgehammer who smashes all the wafers and their architecture snaps a tiny bit cleaner. But they haven't said anything about firing the guy with the sledgehammer. Their paragraph before the final table says that this whole exercise is pretty much meaningless because their numbers are made up about competitors and they aren't even the right numbers to be using. Then the table backs up my paraphrase.
There is nothing inherently good about wafer scale. It's actually harder to dissipate heat and enable hybrid bonding with DRAM. So the gp is entirely correct that you need to actually show higher silicon utilization to be even considered as being something worthwhile.