← Back to context

Comment by MobiusHorizons

9 hours ago

The bathtub curve isn’t the same for all components of a server though. Writing off the entire server because a single ram chip or ssd or network card failed would limit the entire server to the lifetime of the weakest part. I think you would want redundant hot spares of certain components with lower mean time between failures.

We do often write off an entire server because a single component fails because the lifetime of the shortest-lifetime components is usually long enough that even on-earth with easy access it's often not worth the cost to try to repair. In an easy-to-access data centre, the component most likely to get replaced would be hot-swappable drives or power supplies, but it's been about 2 decades since the last time I worked anywhere where anyone bothered to check for failed RAM or failed CPUs to salvage a server. And lot of servers don't have network devices you can replace without soldering, and haven't for a long time outside of really high end networking.

And at sufficient scale, once you plan for that it means you can massively simplify the servers. The amount of waste a sever case suitable for hot-swapping drives adds if you're not actually going to use the capability is massive.