Comment by agnosticmantis
25 days ago
> I lived through the great MTBF vs MTTR (mean-time-between-failure vs. mean-time-to-recovery) reckoning of infrastructure during the transition to cloud and cloud automation.
What's the historical context for this MTBF vs. MTTR reckoning?
If you optimize for MTBF, you optimize for it to be a long time between failures. You optimize for the system not going down in the first place, but when it does do down it might be Pretty Bad.
If you optimize for MTTR, you don't care how often you go down and instead optimize your recovery time to be as short as possible.
The concepts are pre-computing.
Not the GP commenter, but I'm still struggling to understand how this relates to the AI world, or perhaps more importantly, what the historical context was. Did people end up switching to MTTR optimization over MTBF optimization? If so, is the implication that the recovery times got lower but software instability went up as a result?
There are concerns that AI might/will make mistakes. Instead of optimizing for producing perfect code, they think that AI can fix bugs as fast as it produces code and are optimizing for MTTR. Sounds like decision made by people who don't write code regularly, as there is this Architectural drift that happens where you are no longer aware of what's happening in your codebase. As a junior guy I so want this to happen.
MTBF = optimizing quality (reliability, uptime, correctness) of AI product
MTTR = optimize the ability to correct failures when they occur.
He's describing leaders who believe quality no longer matters because any faults or deviations can be corrected so quickly that it doesn't make any sense to waste time on quality.
1 reply →
To give a timely example, think GitHub and what its leadership is thinking/optimizing for. Do you care if you’re down once or twice a week vs how long those down times are? What’s the KPI you’re managing GitHub with?
Current (and by current I mean the last 4-5 years) they only cared about MTTR. That was probably the only metric they measured and cared about. When a system went down it fired an LSI “Live Site Incident” (as opposed to a CRI “Customer Reported Incident”). At the time you grilled your team. Eventually you come to the conclusion that an LSI should only be measured by MTTR. MTBF is meaningless because MTBF limits your “ship new features” velocity.
You might scoff at GitHub and “ship a new feature” concept in the last 5 years, but if you’re an enterprise customer you’d know how much nonesense they shoveled out in the last 5 years. Absolute insanity of “what the fuck” type feature because customer X who is paying $$$ is asking for it type features.
Same grifters optimizing for MTTR are now pushing even more reckless use of AI, because “accidents will happen anyway, so we need to prioritize speed”.
Before the cloud, people were trying to reduce the mean time between failure (MTBF) essentially trying to prevent a thing from failing. With cloud, people are trying to recover as quickly as possible (mean time to recovery) accepting that things will fail —- it’s about how fast you can react to it.
John Allspaw (previously CTO at Etsy) has written about this: https://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-ty...