Comment by agwa

5 years ago

> a mandatory training where leadership read out emails from customers telling us how we let them down and lost their trust

Is that normal at Google? Making people feel bad for an outage doesn't seem consistent with the "blameless postmortem" culture promoted in the SRE book[1].

[1] https://sre.google/sre-book/postmortem-culture/

"Blameless Postmortem" does not mean "No Consequences", even if people often want to interpret it that way. If an organization determines that a disconnect between ground work and a customer's experience is a contributing factor to poor decision making then they might conclude that making engineers more emotionally invested in their customers could be a viable path forward.

  • Relentless customer service is never going to screw you over in my experience... It pains me that we have to constantly play these games of abstraction between engineer and customer. You are presumably working a job which involves some business and some customer. It is not a fucking daycare. If any of my customers are pissed about their experience, I want to be on the phone with them as soon as humanly possible and I want to hear it myself. Yes, it is a dreadful experience to get bitched at, but it also sharpens your focus like you wouldn't believe when you can't just throw a problem to the guy behind you.

    By all means, put the support/enhancement requests through a separate channel+buffer so everyone can actually get work done during the day. But, at no point should an engineer ever be allowed to feel like they don't have to answer to some customer. If you are terrified a junior dev is going to say a naughty phrase to a VIP, then invent an internal customer for them to answer to, and diligently proxy the end customer's sentiment for the engineer's benefit.

    • I think of this is terms of empathy: every engineer should be able to provide a quick and accurate answer to "What do our customers want? And how do they use our product?"

      I'm not talking esoterica, but at least a first approximation.

      4 replies →

  • From the SRE book: "For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the 'wrong' thing prevails, people will not bring issues to light for fear of punishment."

    If it's really the case that engineers are lacking information about the impact that outages have on customers (which seems rather unlikely), then leadership needs to find a way to provide them with that information without reading customer emails about how the engineers "let them down", which is blameful.

    Furthermore, making engineers "emotionally invested" doesn't provide concrete guidance on how to make better decisions in the future. A blameless portmortem does, but you're less likely to get good postmortems if engineers fear shaming and punishment, which reading those customer emails is a minor form of.

    • I work at Google and have written more than a few blameless postmortems. You don't need to quote things to me.

      Is what was described above "finger pointing or shaming"? I don't work in TI so I didn't experience this meeting but it doesn't seem like it is. It also doesn't sound to me like this was the only outcome, where the execs just wagged their fingers at engineers and called it a day. Of course there'd be all sorts of process improvements derived from an understanding of the various system causes that led to an outage.

      2 replies →

Not the original googler responding, but I have never experienced what they describe.

Postmortems are always blameless in the sense that "Somebody fat fingered it" is not an acceptable explanation for the causes of an incident - the possibility to fat finger it in the first place must be identified and eliminated.

Opinions are my own, as always

  • > Not the original googler responding, but I have never experienced what they describe.

    I have also never experienced this outside of this single instance. It was bizarre, but tried to reinforce the point that something needed to change -- it was the latest in a string of major customer-facing outages across various parts of TI, potentially pointing to cultural issues with how we build things.

    (And that's not wrong, there are plenty of internal memes about the focus on building new systems and rewarding complexity, while not emphasizing maintainability.)

    Usually mandatory trainings are things like "how to avoid being sued" or "how to avoid leaking confidential information". Not "you need to follow these rules or else all of Cloud burns down; look, we're already hemorrhaging customer goodwill."

    As I said, there was significant scar tissue associated with this event, probably caused in large part by the initial reaction by leadership.

I assume it was training for all SREs, like "this is why we're all doing so much to prevent it from reoccurring"

I don't think Google really cares about listening to their users. I have spent more than 6 hours trying to get simple warranty issues resolved. I wish they had to feel the pain of their actions and decisions.