Comment by enragedcacti
2 days ago
Any thoughts on why they are disabling so many cores in their current product? I did some quick noodling based on the 46/970000 number and the only way I ended up close to 900,000 was by assuming that an entire row or column would be disabled if any core within it was faulty. But doing that gave me a ~6% yield as most trials had active core counts in the high 800,000s
I could guess that it helps with heat dissipation/management. But I don't know. That guess is from looking at the list of patents[1] they have.
[1] https://patents.justia.com/assignee/cerebras-systems-inc
They did mention that they stash extra cores to enable the re-routing. Those extra cores are presumably unused when not routed in.
That was my first thought but based on the rerouting graphic it seems like the extra cores would be one or two rows and columns around the border which would only account for ~4000 cores.
If the system were broken down into more subdivisions internally, there would be more cores dedicated to replacement. It seems like it could be more difficult to reroute an entire row or column of cores on a wafer than a small block. Perhaps, also, they are building in heavy redundancy for POC and in the future will optimize the number of cores they expect to lose.