Comment by OminousWeapons

5 years ago

> If people are trying to assign blame to a bug or outage, it's time to move on.

This is one of my favorite excerpts. I once worked in a lab where we would have frequent catastrophic failures because there was never any disaster planning or contingency management plan. I personally triaged 3 such incidents alone or with people who happened to be there when the problem arose and attempted to disseminate some suggestions for how to prevent similar problems in the future. No one was interested. People were primarily interested in tearing my head off because I hadn't handled the problem the way they would have done it (of course, they were out drinking beers or sleeping while I was dealing with the issue at 12 AM or on a weekend).

After the third time I said fuck it, the next time there is an issue I am going to insure my own projects are safe and then I'm going home and turning my phone off. Let someone else deal with it. That is the not the culture you want to be promoting.

I was on call as a new developer on a system. I was not given any procedures or trouble shooting documents. I got a call at 1 am, missed it, and waited one minute to see if there was a message. I did see a voicemail, so I started listening and logging on. Before I could even get halfway through, the person called again (why not leave the voicemail on the final attempt?). So I'm looking for the issue/fix for 5 minutes and they tell me they know who the SME is for the functionality, so they will call them. Why even call me if you're just going to call the SME without giving me time to look at it? I got negative feedback from my manager about the way I handled it. So, I asked how I should have handled it without any training or documentation. They said I should have called the SME. Well, I didn't know who the SME was and there's no documentation or list of what who is the SME for which part of the system, nor was I instructed to immediately call the SME. Again, why not just call the SME first if they knew who it was and the SME didn't create documentation because they are "too busy".

  • How was the interview process before you got hired on? Any warning signs that seem obvious now in retrospect?

    • The hiring process for the company wasn't special. Of course half the stuff they claimed in the interview changed later (was hired as a Java dev but was assigned to Filenet, they said rhet dont outsource or layoff but have started doing both).

      This was an internal transfer. There were definitely warning signs in that interview. I was desperate because they were outsourcing my job in an obscure tech (Filenet) and we were expecting a kid.

      The hiring manager said something to the effect of, "I was surprised anyone internal even applied to this job".

      'Warning flag' doesn't do this justice. I have no idea what to call it, but desperation required I ignore it.

      4 replies →

  • > SME didn't create documentation because they are "too busy".

    Because they keep getting shoulder tapped to put out fires. Because they’re the only one who knows the system. Because there is no documentation…

    • Basically. Except there were 3 other tech leads in that area. They didn't know that specific piece of functionality, but they could be given the new work to take stuff off that team's plate to make time for documentation. The leadership in that area didnt really care about anything other than delivering fast. Testing? Eh... Security issues? They're not that big of a deal - do them on an above and beyond basis (contrary to enterprise policy). On call documentation? Not even going to try to create it. I mean really, all you have to do is create a knowledge document out of the SNOW incident ticket. Then the next time it happens there will be a link to the steps taken. But no.

Quoting what a HN user said in another post a few months back "if a person can break a system, the system was broken to start with".

  • Eh, that's a nice thing to say, but it only makes sense at certain scales, and no matter what, there's always a person that can break it.

    If any random person can break it, it's already broken.

    If any employee can break it, it's probably broken (there are very small scales where even this doesn't apply. Ever worked for a company with less than ten people? There's probably something any employee can break).

    If any employee that's an engineer, sysadmin or developer can break it, well now you're at least reducing the problem to a more specific set of people.

    If only the people on a specific team responsible for a system can affect the system, now you've reached fairly good point, where there's separation of concerns and you're mitigating the potential problems.

    If only a single person can break the system, you've gone to far. That effectively means only a single person can fix or work on the system too, and congratulations, you've engineered yourself into a bus-factor of one. Turn right around and go back to making sure a team can work on this.

    Finally, realize that sometimes the thing only one team can break is an underlying dependency for many other things, and they may inadvertently affect those. You can't really engineer yourself out of that problem without making every team and service a silo that going from top to bottom. Got a shared VM infrastuture, whether in house or in the cloud? The people that administer that can cause you problems. Don't ever believe they can't. Your office IT personnel? Yep, they can cause you problems too.

    Some problems you fix by making it so they can't happen. Other problems you fix by making it hard to happen and putting provisions in place that mitigate the problems if they do.

    • There are lots of places where we require that no single person can break the system at least in a certain way.

      For example code review and LGTM ensures that a single individual can't just break the system by pushing bad code.

      Often there are other control planes that don't have the same requirement, but I think the idea that there must always be one person who can break the system isn't clearly true.

      4 replies →

  • What systems are you working on? Many are held together by ritual, and deviating from the ritual causes outages. They’re very fragile in some form (deployment, change, infrastructure, dependencies, etc.). They won’t break if you follow the happy path, but to say they’re so robust that an active attempt at breaking won’t bring them down is ... naive? Not sure if that’s the word I’m looking for.

    I say this as someone who’s worked at large tech companies that are “internet scale”.

Maybe. I've seen the opposite, where no one takes responsibility for anything, and it's also bad. In fact, the situation you describe could also be a lack of anyone else taking responsibility for disaster planning and etc.

I think what is needed is a culture of -ownership-. That's basically people saying "I'm responsible". Not one where everyone tries to avoid responsibility, and not one where peopel point fingers.

  • Why does someone need to take responsibility when you can have a culture of blameless postmortems where everyone focuses on making sure what ever happened never happens again instead? In blameless postmortem culture, everyone is responsible by default

    • "Everyone focuses" = nothing gets done. I've been at places like that, where a post-mortem happens, a course of action is decided on...and then no one owns actually carrying out that course of action.

      You could argue that "It should be assigned" - yeah, it should. But assigning it implies either "here is the team that is responsible for it", i.e., this is the team responsible and they need to be told to fix their shit (which very much sounds like blame), OR it implies "here is the team that I am entrusting to fix it DESPITE their obviously not being responsible for it", which is just as bad, since it implies that the team that 'is' responsible for it is incompetent.

      The only healthy option is that the 'responsible' team stands up to say "hey, that's ours; we'll fix it", and the only way they'll do that is if you have a culture of safety and ownership.

      Also, one thing to make clear - ownership = responsible = blame. They're all words for the same thing, just different implications. You can't have someone 'own' something without making them responsible, and apt to be blamed if you don't ensure the culture is one that does not attach blame. That's really what I was getting at; of course you shouldn't blame. But, you can't also avoid ownership. But ownership implies you know WHO to blame, and so blame comes very easily. And it's very easy to mistake pointing out responsibility/ownership for something as blame; I have had multiple managers tell me "it's not us vs them" when I've raised up the fact that I'm unable to deliver to deadlines because I have been unable to get anything from product.

      3 replies →

    • The blameless postmortem an "legal fiction" that don't really mean that blame cannot be assigned just that blame cannot result in punishment or loss of face/standing.

      At the end of they day you are going to have someone stand up and say: yep we should have planned for this, and we will correct this in x, y, z, ways.

  • What does it mean to be responsible? Just to say it? Responsibility should be accompanied with fines corresponding to the damage or something like it. Otherwise those are just words. I'm responsible, but I'm not getting any fines if something goes wrong, so whatever, but I'm responsible. Fire me if you want, I'll find new work in a few days, but I was responsible.

    It's business owner who's responsible, because ultimately he's getting all the expenses when critical event happens, client leaves, client sues the company, and so on. Other people are not really responsible, they just pretend to be.

    • So I've actually written about this in the past, but, responsibility is -actually effecting the entity-.

      That is, "you're responsible for this" - if they do it, and it succeeds, what happens? If they don't do it, and it fails, what happens? If the answer is "nothing" in either of those cases, they're not actually responsible. If the result is too detached, they're also not actually responsible (i.e., if I decide not to do one of the ten tasks assigned to me, and I don't hear about it until review time, if at all, then I was never responsible).

      Responsibility is innately tied with knowledge and empowerment, but without going on at length, and to just give an example - if I'm the one woken up by the pagerduty alarm when something breaks, I am responsible for that something, because its success or failure directly affects me. If, however, there is a separate ops team that has to deal with it, and I can slumber peacefully, responsibility has been diluted; you won't get as good a result.

    • Just speaking for me, but if my employer would start issuing fines... lets just say i would starting to run...

So how did it play out? Please, do not let me hanging here.

  • Honestly, I don't know; I unknowingly followed the author's advice. About half a year after the last incident, a friend who I went to school with called me up and offered me a job at his fledgling biotech. I accepted and never looked back.