Comment by pantalaimon
13 hours ago
They don't need to be space grade, consumer hardware will do just fine.
For AI a random bit flip doesn't matter much.
13 hours ago
They don't need to be space grade, consumer hardware will do just fine.
For AI a random bit flip doesn't matter much.
Only if that bitflip happens somewhere in your actual data, vs. some GPU pipeline register that then locks up the entire system until a power cycle. Or causes a wrong address to be fetched. Or causes other nasty silent errors. Or...
Try doing fault injection on a chip some time. You'll see it's significantly easier to cause a crash / reset / hang than to just flip data bits.
'rad-triggered bit flips don't matter with AI' is a lie spoken by people who have obviously never done any digital design in their life.
As long as they stay below Van Allen belts and deal with weaker magnetic shielding in sun synchronous orbit (high latitudes).
I would say they probably something a little beefier than consumer hardware and just deal with lots of failures and bit flips.
But cooling is a bigger issue probably?
Random bit flips might even improve output.
Single upset events in a modern GPU are not bitflips. They destroy the surrounding circuitry and usually disable the whole unit.
If that happens you disable that CUDA core. If you GPU is too damaged, you deorbit the satellite.
1 reply →