Comment by ffsm8

3 hours ago

> SREs debugging production outages to find a proximate "root" technical cause is a small fraction of the SRE function.

According to the specified goals of SRE, this is actually not just a small fraction - but something that shouldn't happen. To be clear, I'm fully aware that this will always be necessary - but whenever it happened - it's because the site reliability engineer (SRE) overlooked something.

Hence if that's considered a large part of the job.. then you're just not a SRE as Google defined that role

https://sre.google/sre-book/table-of-contents/

Very little connection to the blog post we're commenting on though - at least as far as I can tell.

At least I didn't find any focus on debugging. It put forward that the capability to produce reliable software is what will distinguish in the future, and I think this holds up and is inline with the official definition of SRE

This makes sense - as am analogy the flight crash investigator is presumably a very different role to the engineer designing flight safety systems.

  • I think you've identified analogous functions, but I don't think your analogy holds as you've written it. A more faithful analogy to OP is that there is no better flight crash investigator than the aviation engineer designing the plane, but flight crash investigation is an actual failure of his primary duty of engineering safe planes.

    Still not a great rendition of this thought, but closer.