← Back to context

Comment by fulafel

8 days ago

Designing to tolerate the defects is well trodden territory. You just expect some rate of defects and have a way of disabling failing blocks.

So you shoot for 10% more cores and disable failing cores?

  • More or less, yes. Of course, defects are not evenly distributed, so you get a lot of chips with different grades of brokenness. Normally the more broken chips gets sold off as lower tier products. A six core CPU is probably an eight core with two broken cores.

    Though in this case, it seems [1] that Cerebras just has so many small cores they can expect a fairly consistent level of broken cores and route around them

    [1]: https://www.cerebras.ai/blog/100x-defect-tolerance-how-cereb...

  • Well, it's more like they have 900,000 cores on a WSE and disable whatever ones that don't work.

    Seriously, that's literally just what they do.

    • In their blog post linked in the sibling comment it says the raw number is 970k and they enable 900k (table at the end).