← Back to context

Comment by darkwater

19 hours ago

You have to remove admin rights to your admins then, because scrappy enough DevOps/platform engineers/whatever will totally hand-edit your AWS infra or Kubernetes deployments. I suffered that first hand. And it's even worse that in the old days, because at least back in the day it was expected.

Or at least you have to automatically destroy and recreate all nodes / VMs / similar every N days, so that nobody can pretend that any truly unavoidable hand-edits during emergency situations will persist. Possibly also control access to the ability to do hand edits behind a break-glass feature that also notifies executives or schedules a postmortem meeting about why it was necessary to do that.

  • I know of at least one organisation that'd automatically wipe every instance on (ssh-)user logout, so you could log in to debug, but nothing you did would persist at all. I quite like that idea, though sometimes being able to e.g. delay the wipe for up to X hours might be slightly easier to deal with for genuinely critical emergency fixes.

    But, yes, gating it behind notifications would also be great.

  • Oh no it ran out of disk space because of bug! I will run a command on that instance to free it rather than fix bug. Oh no error now happens half of the time better debug for hours only to find out someone only fixed a single instance…

    I will never understand the argument for cloud other than bragging rights about burning money and saving money which never shoulda been burning to begin with.

Nah, just run Puppet or similar. You’re welcome to run your command to validate what you already tested in stage, but if you don’t also push a PR that changes the IaC, it’s getting wiped out in a few minutes.

I hate not having root access. I don’t want to have to request permission from someone who has no idea how to do what I want to do. Log everything, make everything auditable, and hold everyone accountable - if I fuck up prod, my name will be in logs, and there will be a retro, which I will lead - but don’t make me jump through hoops to do what I want, because odds are I’ll instead find a way around them, because you didn’t know what you were doing when you set up your security system.

But then your next deployment goes, and it all rolls back, right?

And then it their fault, right?

I might have mild trauma from people complaining their artisanal changes to our environment weren’t preserved.

  • Weeeeell, if you use Helm the manual change might be preserved, which makes investigations even more... interesting.

In my org nobody has admin rights with the exception of emergencies, but we are ending up with a directory full of Github workflows and nobody knows, which of them are currently supposed to work.

Nothing beats people knowing what they are doing and cleaning up behind them.