Comment by sebmellen
25 days ago
Not the GP commenter, but I'm still struggling to understand how this relates to the AI world, or perhaps more importantly, what the historical context was. Did people end up switching to MTTR optimization over MTBF optimization? If so, is the implication that the recovery times got lower but software instability went up as a result?
There are concerns that AI might/will make mistakes. Instead of optimizing for producing perfect code, they think that AI can fix bugs as fast as it produces code and are optimizing for MTTR. Sounds like decision made by people who don't write code regularly, as there is this Architectural drift that happens where you are no longer aware of what's happening in your codebase. As a junior guy I so want this to happen.
MTBF = optimizing quality (reliability, uptime, correctness) of AI product
MTTR = optimize the ability to correct failures when they occur.
He's describing leaders who believe quality no longer matters because any faults or deviations can be corrected so quickly that it doesn't make any sense to waste time on quality.
Yes that’s very correct. The way I think of it, MTTR is easier to measure and manage as a manager. MTTR is all about “operational excellence”. Basically, when shit hits the fan, how good are we at figuring out what caused it and how to fix it. That’s a muscle that you can train, the script goes:
- What alerts are we missing that could have helped us catch that earlier?
- What dashboards could we have had to help diagnose the issue quicker?
- What Ops tools could we have had to help mitigate such issue quicker?
- What extra logging/metrics/telemetry could we add to help us catch this quicker?
- What “safe deployment practices” could we have employed to avoid/improve this?
- what processes could we enforce to facilitate all of that?
Rinse and repeat that few hundreds or thousands of times while mounting MTTR KPI and you will see that number improve. Most likely through your team “gaming it”
MTBF is much, much, tricker to measure or “manage out”. It’s about “excellence in engineering” which is not measurable nor controllable. You want a random feature X. Your team tells you it’s really not how the system works, and they want few months making the change slowly while observing the system. But you don’t want just X, you want X, Y, Z, W, V, Q, A, B, C, D, all the way throw AAZZW12. So you tell the team to go fuck itself.
To give a timely example, think GitHub and what its leadership is thinking/optimizing for. Do you care if you’re down once or twice a week vs how long those down times are? What’s the KPI you’re managing GitHub with?
Current (and by current I mean the last 4-5 years) they only cared about MTTR. That was probably the only metric they measured and cared about. When a system went down it fired an LSI “Live Site Incident” (as opposed to a CRI “Customer Reported Incident”). At the time you grilled your team. Eventually you come to the conclusion that an LSI should only be measured by MTTR. MTBF is meaningless because MTBF limits your “ship new features” velocity.
You might scoff at GitHub and “ship a new feature” concept in the last 5 years, but if you’re an enterprise customer you’d know how much nonesense they shoveled out in the last 5 years. Absolute insanity of “what the fuck” type feature because customer X who is paying $$$ is asking for it type features.
Same grifters optimizing for MTTR are now pushing even more reckless use of AI, because “accidents will happen anyway, so we need to prioritize speed”.