Comment by daxfohl

8 years ago

Side question, as a dev with zero previous ops experience, now the solo techie for a small company and learning ops on the fly, we're obviously in the situation where "all devs have direct, easy access to prod", since I'm the only dev. What steps should I take before bringing on a junior dev?

* Local env setup docs should have no production creds in it (EDIT: production creds should always be encrypted at rest)

* new dev should only have full access to local and dev envs, no write access to prod

* you're backing up all of your databases, right? Nightly is mandatory, hourly is better

* if you don't have a DBA, use RDS

That'll prevent the majority of weapons grade mistakes.

Source: 15 years in ops

Do as best as you can to "find compute room" (laptop, desktop, spare servers on rack that arent being used, .. cloud) , and make a Stage.

Make changes to Stage after doing a "Change Management" process (effectively, document every change you plan to do, so that a average person typing them out would succeed). Test these changes. It's nicer if you have a test suite, but you won't at first.

Once testing is done and considered good, then make changes in accordance to the CM on prod. Make sure everything has a back-out procedure, even if it is "Drive to get backups, and restore". But most of these should be, "Copy config to /root/configs/$service/$date , then proceed to edit the live config". Backing out would entail in restoring the backed-up config.

________________________

Edit: As an addendum, many places too small usually have insufficient, non-existent, or schrodinger-backups. Having a healthy living stage environment does 2 things:

1. You can stage changes so you don't get caught with your pants down on prod, and

2. It is a hot-swap for prod in the case Prod catches fire.

In all likelihood, "All" of prod wouldn't DIAF, but maybe a machine that houses the DB has power issues with their PSU's and fries the motherboard. You at least have a hot machine, even if it's stale data from yesterday's imported snapshot.

  • You missed one of the really nice points of having a stage there. You use it to test your backups by restoring from live every night/week. By doing that, you discourage developing on staging and you know for sure you have working backups!

    • Indeed. But if it's just 1 guy who's the dev, I was trying to go for something that was rigorous, still yet very maintainable.

      Ideally, you want test->stage->prod , with puppet and ansible running on a VM fabric. Changes are made to stage and prod areas of puppet, with configuration management being backed by GIT or SVN or whatever for version control. Puppet manifests can be made and submitted to version control, with a guarantee that if you can code it, you know what you're doing. Ansible's there to run one-off commands, like reloading puppet (default is checkins every 30min)

      And to be even more safe, you have hot backups in prod. Anything that runs in a critical environment can have hot backups or otherwise use HAproxy. For small instances, even things like DRBD can be a great help. Even MySQL, Postgres, Mongo and friends all support master/slave or sharding.

      Generally, get more machines running the production dataset and tools, so if shit goes down, you have: machines ready to handle load, backup machines able to handle some load, full data backups onsite, and full data backups offsite. And the real ideal is that the data is available on 2 or 3 cloud compute platforms, so absolute worst case scenario occurs and you can spin up VM's on AWS or GCE or Azure.

      --Our solution for Mongo is ridiculous, but the best for backing up. The Mongo backup util doesn't guarantee any sort of consistency, so either you lock the whole DB (!) or you have the DB change underneath you while you back it up... So we do LVM snapshots on the filesystem layer and back those up. It's ridiculous that Mongo doesnt have this kind of transactional backup appratus. But we needed time-series data storage. And mongodb was pretty much it.

As I said in another post, the least you can do is modify your hosts file so you can't access the production database from your local computer. Then you have to login to a remote computer to access production.

As adviced somewhere else, before you have a DBA, you should consider buying a hosted service like RDs, that would provide at a minimum backup's and restore points. Even have separate dev and prod accounts on RDS.

  • before you have a DBA

    You never don't have a DBA. If you don't know who it is, it's you! But there will always, always be someone who is held responsible for the security, integrity and availability of the company's asset.

Best rule of thumb whenever you're doing work as a solo dev/ops guy is to always think in terms of being two people: the normal you (with super user privs etc.) and the "junior dev/ops" you who jut started his first day. Whatever you're working on needs to support both variants of you (with appropriate safeguards, checks and balances in place for junior you).

E.g. when deciding how to backup your prod database, if you're thinking as both "personas" you'll come up with a strategy that safely backs up the database but also makes it easy for a non-privileged user to securely suck down a (optionally sanitised) version of the latest snapshot for seeding their dev environment with [ and then dog food this by using the process to seed your own dev environment ].

Some other quick & easy things:

- Design your terraform/ansible/whatever scripts such that anything touching sensitive parts needs out of band ssh keys or credentials. E.g. if you have a terraform script that brings up your prod environment on AWS, make sure that the aws credentials file it needs isn't auto-provisioned alongside the script. Instead write down on a wiki somewhere that the team member (at the moment, you) who has authority to run that terraform script needs to manually drop his AWS credentials file in directory /x/y/z before running the script. Same goes for ansible: control and limit which ssh keys can login to different machines (don't use a single "devops" key that everyone shares and imports in to their keychains!). Think about what you'll need to do if a particular key gets compromised or a person leaves the team.

- Make sure your backups are working, taken regularly, stored in multiple places and encrypted before they leave the box being backed up. Borgbackup and rsync.net are a nice, easy solution for this.

- Make sure you test your backups!

- Don't check passwords/credentials in to source code without first encrypting them.

- Use sane, safe defaults in all scripts. Like another poster mentioned, don't do if env != "test"; do prod_stuff();

- RTFM and spend the extra 20 minutes to set things up correctly and securely rather than walking away the second you've got something "working" (thinking 'I'll come back later to tidy this up' - you never will).

- Follow the usual security guidelines: firewall machines (internally and externally), limit privileges, keep packages up to date, layer your defences, use a bastion machine for access to your hosted infrastructure

- Get in the habit of documenting shit so you can quickly put together a straight forward on-boarding/ops wiki in a few days if you suddenly do hire a junior dev (or just to help yourself when you're scratching your head 6 months later wondering why you did something a certain way)