Comment by ryao
3 days ago
> Take the Nvidia H100 – a massive GPU weighing in at 814mm2. Traditionally this chip would be very difficult to yield economically. But since its cores (SMs) are fault tolerant, a manufacturing defect does not knock out the entire product. The chip physically has 144 SMs but the commercialized product only has 132 SMs active. This means the chip could suffer numerous defects across 12 SMs and still be sold as a flagship part.
Fault tolerance seems to be the wrong term to use here. If I wrote this, I would have written redundant.
Redundant cores lead to a fault tolerant chip.
ECC memory is fault tolerant. It repairs issues on the fly without disabling hardware. This on the other hand is merely redundant to handle manufacturing defects. If they make a mistake and ship a bad core that malfunctions at runtime, it is not going to tolerate that.
Redundancy is a method of providing fault tolerance, the existence of other methods doesn't make it less fault tolerant.
Nothing is tolerant to all possible faults. Fault tolerance refers to being able to tolerate specific types of faults under specific conditions.
Fault tolerant is the proper term for this.
1 reply →