Comment by fulafel
8 days ago
Designing to tolerate the defects is well trodden territory. You just expect some rate of defects and have a way of disabling failing blocks.
8 days ago
Designing to tolerate the defects is well trodden territory. You just expect some rate of defects and have a way of disabling failing blocks.
So you shoot for 10% more cores and disable failing cores?
More or less, yes. Of course, defects are not evenly distributed, so you get a lot of chips with different grades of brokenness. Normally the more broken chips gets sold off as lower tier products. A six core CPU is probably an eight core with two broken cores.
Though in this case, it seems [1] that Cerebras just has so many small cores they can expect a fairly consistent level of broken cores and route around them
[1]: https://www.cerebras.ai/blog/100x-defect-tolerance-how-cereb...
Well, it's more like they have 900,000 cores on a WSE and disable whatever ones that don't work.
Seriously, that's literally just what they do.
In their blog post linked in the sibling comment it says the raw number is 970k and they enable 900k (table at the end).