← Back to context

Comment by 0xbadcafebee

7 days ago

There's a fallacy often repeated for computers: "It's lasted a long time so it's going to keep lasting a long time." The thing is, failure of computer hardware is often due to manufacturing flaws. There's many that could have flaws, and they're subject to (varying) environmental stresses (both at build time at run time), so there's many failure modes.

It's difficult to know exactly when a server might fail. It might be within 1 month of its build, it might be 50 years. But what's clear is that failure isn't less likely as the machine gets older, it's more likely. There are outliers, but they;re rare. The failure modes for these things are well recorded, and the whole thing is designed to fail within a certain number of hours (if it's not the hard drive, it's the fan, the cpu, the memory, the capacitors, the solder joints, etc). It doesn't get better as it ages.

But environmental stress is often a predictor of how long it lives. If the machine is cooled properly, in a low-humidity environment, is jostled less, run at low-capacity (fans not running as hard, temperature not as high, disks not written to as much, etc), then it lives longer. So you can decrease the probability of failure, and it may live longer. But it also might drop dead tomorrow, because again there may be manufacturing flaws.

If given the choice, I wouldn't buy an old machine, because I don't know what kind of stress it's had, and the math is stacked against it.

> But what's clear is that failure isn't less likely as the machine gets older, it's more likely.

Is this true? Doesn't most hardware have a dip in failure rate in the middle of its average lifespan?

  • It depends on the components. The bathtub curve applies the most manufactured equipment in some way. But specific kinds of hardware are more prone to it than others. Hard drives, fans, power supplies, dedicated controllers, RAM and CPU modules, etc all fail at different rates. Combine that with the varying failure rates of different grades of components, with manufacturer/model differences, environmental differences, and load differences, and it's all over the map. But in general, any one of these components is effectively a system failure, so there is always this varying degree of failure over time due to the fluctuation of all these variables.

    I also believe there's a psychic component to failures. The machines know when you're close to product launch, or when someone has just discovered the servers haven't been maintained in a while and are at risk of failing. Then they'll fail for sure. Especially if there are hot-spare or backup servers, which will conveniently fail as well.