Comment by macNchz

19 hours ago

Working on various teams operating on infrastructure that ranged from a rack in the back of the office, a few beefy servers in a colo, a fleet of Chef-managed VMs, GKE, ECS, and various PaaSes, what I've liked the most about the cloud and containerized workflows is that they wind up being a forcing function for reproducibility, at least to a degree.

While it's absolutely 100% possible to have a "big beefy server architecture" that's reasonably portable, reproducible, and documented, it takes discipline and policy to avoid the "there's a small issue preventing {something important}, I can fix it over SSH with this one-liner and totally document it/add it to the config management tooling later once we've finished with {something else important}" pattern, and once people have been doing that for a while it's a total nightmare to unwind down the line.

Sometimes I want to smash my face into my monitor the 37th time I push an update to some CI code and wait 5 minutes for it to error out, wishing I could just make that band-aid fix, but at the end of the day I can't forget to write down what I did, since it's in my Dockerfile or deploy.yaml or entrypoint.sh or Terraform or whatever.

You have to remove admin rights to your admins then, because scrappy enough DevOps/platform engineers/whatever will totally hand-edit your AWS infra or Kubernetes deployments. I suffered that first hand. And it's even worse that in the old days, because at least back in the day it was expected.

  • Or at least you have to automatically destroy and recreate all nodes / VMs / similar every N days, so that nobody can pretend that any truly unavoidable hand-edits during emergency situations will persist. Possibly also control access to the ability to do hand edits behind a break-glass feature that also notifies executives or schedules a postmortem meeting about why it was necessary to do that.

    • I know of at least one organisation that'd automatically wipe every instance on (ssh-)user logout, so you could log in to debug, but nothing you did would persist at all. I quite like that idea, though sometimes being able to e.g. delay the wipe for up to X hours might be slightly easier to deal with for genuinely critical emergency fixes.

      But, yes, gating it behind notifications would also be great.

      2 replies →

    • Oh no it ran out of disk space because of bug! I will run a command on that instance to free it rather than fix bug. Oh no error now happens half of the time better debug for hours only to find out someone only fixed a single instance…

      I will never understand the argument for cloud other than bragging rights about burning money and saving money which never shoulda been burning to begin with.

  • Nah, just run Puppet or similar. You’re welcome to run your command to validate what you already tested in stage, but if you don’t also push a PR that changes the IaC, it’s getting wiped out in a few minutes.

    I hate not having root access. I don’t want to have to request permission from someone who has no idea how to do what I want to do. Log everything, make everything auditable, and hold everyone accountable - if I fuck up prod, my name will be in logs, and there will be a retro, which I will lead - but don’t make me jump through hoops to do what I want, because odds are I’ll instead find a way around them, because you didn’t know what you were doing when you set up your security system.

  • But then your next deployment goes, and it all rolls back, right?

    And then it their fault, right?

    I might have mild trauma from people complaining their artisanal changes to our environment weren’t preserved.

  • In my org nobody has admin rights with the exception of emergencies, but we are ending up with a directory full of Github workflows and nobody knows, which of them are currently supposed to work.

    Nothing beats people knowing what they are doing and cleaning up behind them.

I'm still a pretty big fan of Docker (compose) behind Caddy as a reverse-proxy... I think that containers do offer a lot in terms of application support... even if it's a slightly bigger hoop to get started with in some ways.

  • I'm working on an app server that's auto deploying itself behind Caddy + DNS/SSL aut config. Caddy is amazing, and there really should be no reason for complex setups for most people these days... I've worked on some huge systems, but most systems can run in trivially simple setups given modern hardware.