Comment by joshuamorton

5 years ago

There are lots of places where we require that no single person can break the system at least in a certain way.

For example code review and LGTM ensures that a single individual can't just break the system by pushing bad code.

Often there are other control planes that don't have the same requirement, but I think the idea that there must always be one person who can break the system isn't clearly true.

> For example code review and LGTM ensures that a single individual can't just break the system by pushing bad code.

Assuming code reviews are 100% effective at catching issues, particularly issues that cross multiple projects. Yes.

  • I'm making an (admittedly subtle) distinction here between complex mistakes, where something was missed, and simple mistakes/bad actors where someone used a privilege in a manner they shouldn't have.

    LGTM ensures that, for example, a single individual can't push a code change that drops the database. On the other hand, that same individual might be able to turn of the database in the AWS console.

    • > LGTM ensures that, for example, a single individual can't push a code change that drops the database.

      Personally, I've seen LGTM let slip complex bugs in accounting code (admittedly, not great code) that went on to irreversibly corrupt hundreds of millions of accounting records.

      Yes, it will catch "DROP DATABASE", but when it's still letting through major bugs that similarly require a full restore from backup... It seems functionally equivalent?

      Given:

      > There are lots of places where we require that no single person can break the system at least in a certain way.

      I don't think code reviews are a solution. I mean, they're one of the better solutions I can think of, but they're not actually a solution.

> For example code review and LGTM ensures that a single individual can't just break the system by pushing bad code.

There's always someone with rights to push code manually, or tell the system to push an older version of code which won't work anymore. Someone needs to install and administer the system that pushes the code, and even if they don't have direct access to push the code to where it eventually goes, someone's access credentials (or someone that controls the system that has access credentials) has access somewhere along the way.

But who controls that the code system is up and available even allow checkins? Can one person break that? What about who controls the power state of the systems the code gets checked in on? Is that also ensured not to be a single person? What about office access? What about the power main at your building? Is it really impossible for one person to cause problems there?

It might sound like I'm changing the goal posts, but that's sort of my point, these are all dependencies on each other. It's impossible to actually make it so one person can't cause any problems, because you can't eliminate all dependencies, and you can even accurately know what they all are. What you can do is focus on the likely ones, put whatever in place you can that's sane, but focus all the crazy effort you would have to do to track down the diminishing returns of trying to make failure impossible and start spending that time and effort on making recovery quick and easy.

Unfortunately, some work that goes into attempting to make sure any one person can't cause a problem might actually make that harder. Requiring someone to sign off on a commit to go live is great at 2 PM Tuesday, but not so great when it's required to fix something at 2 AM Sunday. This is the tightrope that needs to be walked, and also while even if you don't necessarily know about it, there probably is someone that has access to break something all by themselves, because they're who is called in to makes sure it can be fixed when the shit hits the fan and all those roadblocks to prevent problems need to be bypassed so the current problem can actually be fixed.

Any system that doesn't have some people like that at various levels persists in that state only until they have a problem and in the incident assessment someone needs to answer why a 5 minute fix took hours and the answer includes a lot of "we needed umpteen different people and only a fraction were available immediately".

Even at Google (which I see you work at from your profile), my guess is that people in the SRE stack can cause a very bad day for most apps. My guess is that even if the party line is that no one person can screw anything up, you probably don't have to ask to many SREs there before someone notes that it's more of an aspiration than a reality.

Sorry if that's a bit rambly. I know you weren't specifically countering what I was saying. I've just had a lot of years of sysadmin experience where it's pretty easy to see the gaps on a lot of these solutions where the face presented looks pretty secure.