Accidentally destroyed production database on first day of a job

8 years ago (np.reddit.com)

Sorry, but if a junior dev can blow away your prod database by running a script on his _local_ dev environment while following your documentation, you have no one to blame but yourself. Why is your prod database even reachable from his local env? What does the rest of your security look like? Swiss cheese I bet.

The CTO further demonstrates his ineptitude by firing the junior dev. Apparently he never heard the famous IBM story, and will surely live to repeat his mistakes:

After an employee made a mistake that cost the company $10 million, he walked into the office of Tom Watson, the C.E.O., expecting to get fired. “Fire you?” Mr. Watson asked. “I just spent $10 million educating you.”

  • Seriously. The CTO in question is the incompetent one. S/he failed:

    - Access control 101. Seriously, this is pure incompetence. It is the equivalent of having the power cord to the Big Important Money Making Machine snaking across the office and under desks. If you can't be arsed to ensure that even basic measures are taken to avoid accidents, acting surprised when they happen is even more stupid.

    - Sensible onboarding documentation. Why would prod access information be stuck in the "read this first" doc?

    - Management 101. You just hired a green dev just out of college who has no idea how things are supposed to work. You just fired him in an incredibly nasty way for making an entirely predictable mistake that came about because of your lack of diligence at your job (see above).

    Also, I have no idea what your culture looks like, but you just told all your reports that honest mistakes can be fatal and their manager's judgement resembles that of a petulant 14 year-old.

    - Corporate Communications 101. Hindsight and all that, but it seems inevitable that this would lead to a social media trash fire. Congrats on embarrassing yourself and your company in an impressive way. On the bright side, this will last for about 15 minutes and then maybe three people will remember. Hopefully the folks at your next gig won't be among them.

    My take away is that anyone involved in this might want to start polishing their resumes. The poor kid and the CTO for obvious reasons, and the rest of the devs, because good lord, that company sounds doomed.

    • Yeah when I read that my first thought was that the CTO reacted that way because he was in fear of being fired himself. I wouldn't be at all surprised if he wrote that document or approved it himself.

      13 replies →

  • Here's some simple practical tips you can use to prevent this and other Oh Shit Moments(tm):

    - Unless you have full time DBAs, do use a managed db like RDS, so you don't have to worry about whether you've setup the backups correctly. Saving a few bucks here is incredibly shortsighted, your database is probably the most valuable asset you have. RDS allows point-in-time restore of your DB instance to any second during your retention period, up to the last five minutes. That will make you sleep better at night.

    - Separate your prod and dev AWS accounts entirely. It doesn't cost you anything (in fact, you get 2x the AWS free tier benefit, score!), and it's also a big help in monitoring your cloud spend later on. Everyone, including the junior dev, should have full access to the dev environment. Fewer people should have prod access (everything devs may need for day-to-day work like logs should be streamed to some other accessible system, like Splunk or Loggly). Assuming a prod context should always require an additional step for those with access, and the separate AWS account provides that bit of friction.

    - The prod RDS security group should only allow traffic from white listed security groups also in the prod environment. For those really requiring a connection to the prod DB, it is therefore always a two-step process: local -> prod host -> prod db. But carefully consider why are you even doing this in the first place? If you find yourself doing this often, perhaps you need more internal tooling (like an admin interface, again behind a whitelisting SG).

    - Use a discovery service for the prod resources. One of the simplest methods is just to setup a Route 53 Private Hosted Zone in the prod account, which takes about a minute. Create an alias entry like "db.prod.private" pointing to the RDS and use that in all configurations. Except for the Route 53 record, the actual address for your DB should not appear anywhere. Even if everything else goes sideways, you've assumed a prod context locally by mistake and you run some tool that is pointed to the prod config, the address doesn't resolve in a local context.

    • You made a lot of insightful point here, but I'd like to chime in on one important point:

      > - Unless you have full time DBAs, do use a managed db like RDS, so you don't have to worry about whether you've setup the backups correctly.

      The real way to not worry about whether you've set up backups correctly is to set up the backups, and actually try and document the recovery procedure. Over the last 30 years I've seen situations beyond counting of nasty surprises when people actually try to restore their backups during emergencies. Hopefully checking the "yes back this up" checkbox on RDS covers you, but actually following the recovery procedure and checking the results is the only way to not have some lingering worry.

      In this particular example, there might be lingering surprises like part of the data might be in other databases, storage facilities like S3 that don't have backups in sync with the primary backup, or caches and queues that need to be reset as part of the recovery procedure.

      2 replies →

    • And put a firewall between your dev machines and your production database. All production database tasks need to be done by someone who has permission to cross in to the production side -- a dev machine shouldn't be allowed to talk to it.

      1 reply →

  • I agree, it's the fault of the CTO. To me, the CTO sounds pretty incompetent. The junior engineer did them a favor. This company seems like it is an amateur hour operation, since data was deleted so easily by an junior engineer.

    • Yup, I've heard stories of junior engineers causing millions of dollars worth out outages. In those case the process was drilled into, the control that caused it fixed and the engineer was not given a reprimand.

      If you have an engineer that goes though that and shows real remorse your going to have someone who's never going to make that mistake(or similar ones) again.

      21 replies →

    • Yep. I had a junior working for me once a few years ago that made a rather unfortunate error in production which deleted all of several customers' data. I could tell he was on pins and needles when he brought it to me, so I let him off the hook right away and showed him the procedures to fix the issue. He said something about being thankful there was a way to fix the problem, and I just smiled and told him A) it would have been my fault if there hadn't been; and B) he wouldn't have had the access he did without safeguards in place. Then I told him a story about the time I managed to accidentally delete an entire database of quarantined email from a spam appliance I was working on several years earlier. Sadly, my CTO at the time did NOT prepare for that.

      I lost a whole weekend of sleep in recovering that one from logs, and that was when I learned some good tricks for ensuring recoverability....

    • Agreed. Also, why didn't they have a backup of some sort? The hard drive on the server could have failed and it would have been just as bad.

      Sounds like an incompetent set of people running the production server.

      10 replies →

  • "It's your first day, we don't understand security so here's the combination to the safe. Have fun!!"

  • If someone on their first day of work can do this much damage, what could a disgruntled veteran do? If Snowden has taught us anything, it's that internal threats are just as dangerous as external threats.

    This shop sounds like a raging tire fire of negligence.

  • He didn't follow the docs exactly. That doesn't matter, though, your first day should be bulletproof and if it's not, it's on the CTO. The buck does not stop with junior engineers on their first day.

  • Thanks for Tom Watson quote, I'd never heard it before, it's a good one. Also agree with everything else you just said, this is not the junior devs fault at all.

  • He might be inept, but in this instance the CTO is mainly just covering his own ass.

    • "Yeah the whole site is buggered, and the backups aren't working - but I fired the Junior developer who did it" Is not how you Cover Your Ass ™.

      2 replies →

I was on a production DB once, and ran SHOW FULL PROCESSLIST, and saw "delete from events" had been running for 4 seconds. I killed the query, and set up that processlist command to run ever 2 seconds. Sure enough, the delete kept reappearing shortly after I killed it. I wasn't on a laptop, but I knew the culprit was somewhere on my floor of the building, so I grabbed our HR woman who was walking by and told her to watch the query window, and if she saw delete, I showed her how to kill the process. Then I ran out and searched office to office until I found the culprit -

Our CTO thought he was on his local dev box, and was frustrated that "something" was keeping him from clearing out his testing DB.

Did I get a medal for that? No. Nobody wanted to talk about it ever again.

  • Actually, the CTO should have mailed the dev team saying:

        Hi,
    
        Yesterday, I thought I was on my local machine and clear the database, while I was in fact on the production server.
        Luckily knodi123 caught it and killed the delete process. This is a reminder that *anybody* can make mistakes, 
        so I want to set up some process to make sure this can't happen, but meanwhile I would like to thank knodil123.
    
       Best,
    
       CTO

  • Sometimes I get reminded about how awesome some of the tech we use is, in this case, transactions :)

  • Oh god, this is the worst... People make errors, you help them and they don't give you any credit. Hope you are working somewhere else now.

My comment I left there:

Lots of folks here are saying they should have fired the CTO or the DBA or the person who wrote the doc instead of the new dev. Let me offer a counter point. Not that it will happen here ;)

They should have run a post mortem. The idea behind it should be to understand the processes that led to a situation where this incident could happen. Gather stories, understand how things came to be.

With this information, folks can then address the issues. Maybe it shows that there is a serially incompetent individual who needs to be let go. Or maybe it shows a house of cards with each card placement making sense at the time and it is time for new, better processes and an audit of other systems.

The point being is that this is a massive learning opportunity for all those involved. The dev should not have been fired. The CTO should not have lost his shit. The DB should have regularly tested back ups. Permissions and access needs to be updated. Docs should be updated to not have sensitive information. The dev does need to contact the company to arrange surrender of the laptop. The dev should document everything just in case. The dev should have a beer with friends and relax for the weekend and get back on the job hunt next week. Later, laugh and tell of the time you destroyed prod on your first day (and what you learned from it).

  • The firing order, in theoretical order for preventing future problems:

    1. CTO As the one in charge of the tech, allows loss of critical data. If anyone should be fired, it's the cto. And firing this guy apparently will have the greatest positive impact to the company. Assuming they can hire a better one. I think given how stupid this cto is, that should be straightforward.

    2. The executives who hired the cto. People hire people similar to themselves, it seems the executives team are clueless about what kind of skills a cto should have. These people will continue fail the dev team by hiring incompetent people, or force them to work in a way that causes problem.

    3. Senior devs in the team. Obviously these people did not test what they wrote. If anyone had ever dryrun the training doc, they should prevent the catastrophe. It's a must do in today's dev environment. The normal standard is to write automatic tests for every thing though.

    This junior dev is the only one who should not be fired...

    • I'm amazed at how quickly everyone is trying to allocate blame, as if there must be someone upon whom to heap it all on. Commenters on both Reddit and HN are being high and mighty, offering wisdom that they would never have allowed this to take place, while eager to point fingers. I bet far more than half of these commenters have at one time or another worked for at least one company that had this kind of setup, and didn't immediately refuse to work on other tasks until the setup was patched. Hypocrites.

      The fact is, this kind of scenario is extremely common. Most companies I have worked for have the production database accessible from the office. It's a very obvious "no no", but it's typical to see this at small to medium sized companies. The focus is on rushing every new feature under the sun, and infrastructure is only looked at if something blows up.

      Nobody should have been fired. Not the developer, not the senior devs, not the sysadmins, and not the CTO. This should have been nothing more than a wake-up call to improve the infrastructure. That's it. The only blame here lies with the CTO - not for the event having taken place, but only because their immediate reaction was to throw the developer under the bus. A decent CTO would have simply said "oh shit guys, this can't happen again. please fix it". If the other executives can't understand that sometimes shit happens, and that a hammer doesn't need to be dropped on anyone, then they're not qualified to be running a business.

      1 reply →

  • You are right that this is an opportunity to learn because it is a demonstration of incompetence at many levels. However, this incompetence has consequences that might be fatal for the company. How much time and effort will be required to level up ? As a CEO I would request an independent audit ASAP on this incident and see the real extend of the problem.

    As my mother said, if you put a good apple with bad apples, it's not the bad apples that become good.

  • They are in no condition, yet, to run a post mortem. At this moment they're probably still trying to figure out how to get their data back or maybe just close up shop entirely.

    You run a post mortem when you're back and running again. They may never be back and running again.

>"i instead for whatever reason used the values the document had."

>They put full access plaintext credentials for their production database in their tutorial documentation

WHAT THE HELL. Wow. I'd be shocked at that sort of thing being written out in a non-secure setting, like, anywhere, at all, never mind in freaking documentation. Making sure examples in documentation are never real and will hard fail if anyone tries to use them directly is not some new idea, heck there's an entire IETF RFC (#2606) devoted to reserving TLDs specifically for testing and example usage. Just mind blowing, and yeah there are plenty of WTFs there that have already been commented on in terms of backups, general authentication, etc. But even above all that, if those credentials had full access then "merely" having their entire db deleted might even have been a good case scenario vs having the entire thing stolen which seems quite likely if their auth is nothing more then a name/pass and they're letting credentials float around like that.

It's a real bummer this guy had such an utterly awful first day on a first job, particularly since he said he engaged in a huge move and sunk quite a bit of personal capital from the sound of it in taking that job. At the same time that sounds like a pretty shocking place to work and it might have taught a ton of bad habits. I don't think it's salvageable but I'm not even sure he should try, they likely had every right to fire him but threatening him at all with "legal" for that is very unprofessional and dickish. I hope he'll be able to bounce back and actually end up in a much better position a decade down the line, having some unusually strong caution and extra care baked into him at a very junior level.

  • There's also a high chance that document was shared on Slack. In which case, they were one Slack breach away from the entire world having write access to their prod database.

    It's depressing how many companies blindly throw unencrypted credentials around like this.

    • Tell me about it. Fortunately where I work is sane and reasonable about it.

      We have a password sheet. You have to be on the VPN(login/password). Then you can log in. Login/Password(diff from above)/2nd password+OTP. Then a password sheet password.

      I'm still rooting out passwords from our repo with goobers putting creds in sourcecode (yeah, not config files....grrrrr). But I attack them as I find them. Ive only found 1 root password for a DB in there... and thankfully changed!

      3 replies →

    • Slack getting hacked would definitely be a mess. There's going to be so many cloud credentials, passwords, keys, customer info...

    • The exact same slack that he remained in for several hours after being fired. Even worse way to provoke a response from a disgruntled employee...

Plot twist: CTO or senior staff needed to cover something up (maybe a previous loss of critical business data) and arranged for this travesty to likely happen permitted sufficient number of junior devs went through "local db setup guide" mockery of a doc.

Either that or this is a "Worst fuckup on the first day on job" fantasy piece - I refuse to acknowledge living in the world where alternatives have any meaningful non-zero probability of occurring.

  • There are no upper bounds on incompetence. I've seen enough WTFs even in companies that didn't seem particularly dysfunctional, and that had some very competent people.

    And then it takes only one shitty manager, or manager in a bad mood, to get the innocent junior dev fired.

People will screw up, so you have to do simple things to make screwing up hard. The production credentials should never have been in the document. Letting a junior have prod level access is not that far out of the normal in a small startup environment, but don't make them part of the setup guide. Sounds like they also have backup issues, which points to overall poor devops knowledge.

Not part of this story, but another pet peeve of mine is when I see scripts checking strings like "if env = 'test' else <runs against prod>". This sets up another WTF situation if someone typos 'test' now the script hits prod.

  • Heh, or take a Netflix Chaos Monkey approach and have a new employee attempt to take down the whole system on their first day and fire any engineers who built whatever the new employee is actually able to break!

    • Why fire them? It's valuable experience that you are paying a lot for them to gain. Better: hold a postmortem, figure out what broke, and make the people who screwed it up originally fix it. Keep people who screw things up, as long as they also fix it.

      1 reply →

  • > so you have to do simple things to make screwing up hard

    No one goes out of their way to screw up; I'd recommend making it easier to recognize when you've made a mistake, and recover from it.

    Except for critical business stuff, that needs severe "you cannot fuck this up" safeguards.

Yeah, another case of "blame the person" instead of "blame the lack of systems". A while back, there was a thread here on how Amazon handled their s3 outage, caused by a devops typo. They didn't blame the DevOp guy, and instead beefed up their tooling.

I wonder whether that single difference - blame the person vs fix the system/tools predicts the failure or success of an enterprise?

  • I think it's a major predictor for how pleasant it is for anyone to work at the company, and thus a long term morale and hiring issue.

    This is the sort of situation that makes for a great conference talk on how companies react to disaster, and how the lessons learned can either set the company up for robust dependable systems or a series of cascading failures.

    Unfortunately, the original junior dev was living the failure case. Fortunately, he has learned early in his career that he doesn't want to work for a company that blames the messenger.

Assuming the details are correct, this should be considered a win by the junior dev. It only took a day to realize that this is a company he really, really doesn't want to try to learn his profession at.

  • He should get that laptop back to them IMMEDIATELY. These sound like exactly the sort of douches would try to charge him with theft. (Edit: Why is it not surprising they don't have a protocol in place for managing dismissing staff and, like, getting their stuff back?)

    • Well, the customers database with important data just got nuked, so even if there is protocol, people who would normally do the steps have different things in mind. Laptop and such is least of their concerns.

  • Nobody hires you if things are perfect. They hire you because there's a problem. It might be a startup or a company just starting a tech sector. Either way they are in their infancy.

    • This isn't imperfection. This is beyond incompetence into some sort of Dunning-Kruger zen state. The story describes failures so egregious that the principals have no business taking money from customers.

> The CTO told me to leave and never come back. He also informed me that apparently legal would need to get involved due to severity of the data loss.

I don't know if I should laugh or cry here.

Guaranteed the CTO is busily rewriting the developer quide and excising all production DB credentials from the docs so that he can pretend they were never there. While the new guy's mistake was unfortunate in a very small way, the errors made by the CTO and his team were unfortunate in a very big way. The vague threat of legal action is laughable, and the reaction of firing the junior dev who stumbled into their swamp of incompetency on his first day speaks volumes about the quality or the organization and the people who run it. My advice... learn something from the mistake, but otherwise walk away from that joint and never look back. It was a lucky thing that you found out what a mess they are on day 1.

Several years back I worked as a DBA at a managed database services company, and something very similar happened to one of our customers who ran a fairly successful startup. When we first onboarded them I strongly recommended that the first thing we do is get their DB backups happening on a fixed schedule, rather than an ad-hoc basis, as their last backup was several months old. The CEO shuts me down, and instead insists that we focus on finding a subtle bug (you can't nest transactions in MySQL) in one of their massive stored procedures.

It turns out their production and QA database instances shared the same credentials, and one day somebody pointed a script that initializes the QA instances (truncate all tables, insert some dummy data) at the production master. Those TRUNCATE TABLE statements replicated to all their DB replicas, and within a few minutes their entire production DB cluster was completely hosed.

Their data thankfully still existed inside the InnoDB files on disk, but all the organizational metadata data was gone. I spent a week of 12 hour days working with folks from Percona to recover the data from the ibdata files. The old backup was of no use to us since it was several months old, but it was helpful in that it provided us a mapping of the old table names to their InnoDB tablespace ids, a mapping destroyed by the TRUNCATE TABLE statements.

No disrespect to the OP but this sounds pretty fake. If the database in question was important enough to fire someone immediately over then there wouldn't have been the creds floating around on an onboarding pdf. And involving legal? Has anyone here heard of anything similar? I'm just 1 datapoint but I know I haven't.

  • Yeah, I thought it sounded fake as well. I mean things like this happen, but something about the story just doesn't ring true to me.

  • Realised the user account is 3 weeks old, which is a red flag for me since it has no posts and the events allegedly happened friday

It's not the CTO's fault. It's the document's fault! We should never have documentation again, this is what it has done to us! We need to revert to tribal knowledge to protect ourselves. If we didn't document these values, people wouldn't be pasting them in places they shouldn't be!

/s

For some years now I've stopped bothering with database passwords. If technically required I just make them the same as the username (or the database name, or all three the same if possible). Why? Because the security offered by such passwords is invariably a fiction in practice, I've never seen an org where they couldn't be dug out of docs or a wiki or test code. Instead database access should be enforced by network architecture: the production database can only be accessed by the production applications, running in the production LAN/VPC. With this setup no amount of accidental (or malicious) commands run by anyone from their local machine (or any other non production environment) could possibly damage the production data.

Side question, as a dev with zero previous ops experience, now the solo techie for a small company and learning ops on the fly, we're obviously in the situation where "all devs have direct, easy access to prod", since I'm the only dev. What steps should I take before bringing on a junior dev?

  • * Local env setup docs should have no production creds in it (EDIT: production creds should always be encrypted at rest)

    * new dev should only have full access to local and dev envs, no write access to prod

    * you're backing up all of your databases, right? Nightly is mandatory, hourly is better

    * if you don't have a DBA, use RDS

    That'll prevent the majority of weapons grade mistakes.

    Source: 15 years in ops

  • Do as best as you can to "find compute room" (laptop, desktop, spare servers on rack that arent being used, .. cloud) , and make a Stage.

    Make changes to Stage after doing a "Change Management" process (effectively, document every change you plan to do, so that a average person typing them out would succeed). Test these changes. It's nicer if you have a test suite, but you won't at first.

    Once testing is done and considered good, then make changes in accordance to the CM on prod. Make sure everything has a back-out procedure, even if it is "Drive to get backups, and restore". But most of these should be, "Copy config to /root/configs/$service/$date , then proceed to edit the live config". Backing out would entail in restoring the backed-up config.

    ________________________

    Edit: As an addendum, many places too small usually have insufficient, non-existent, or schrodinger-backups. Having a healthy living stage environment does 2 things:

    1. You can stage changes so you don't get caught with your pants down on prod, and

    2. It is a hot-swap for prod in the case Prod catches fire.

    In all likelihood, "All" of prod wouldn't DIAF, but maybe a machine that houses the DB has power issues with their PSU's and fries the motherboard. You at least have a hot machine, even if it's stale data from yesterday's imported snapshot.

    • You missed one of the really nice points of having a stage there. You use it to test your backups by restoring from live every night/week. By doing that, you discourage developing on staging and you know for sure you have working backups!

      1 reply →

  • As I said in another post, the least you can do is modify your hosts file so you can't access the production database from your local computer. Then you have to login to a remote computer to access production.

  • As adviced somewhere else, before you have a DBA, you should consider buying a hosted service like RDs, that would provide at a minimum backup's and restore points. Even have separate dev and prod accounts on RDS.

    • before you have a DBA

      You never don't have a DBA. If you don't know who it is, it's you! But there will always, always be someone who is held responsible for the security, integrity and availability of the company's asset.

  • Best rule of thumb whenever you're doing work as a solo dev/ops guy is to always think in terms of being two people: the normal you (with super user privs etc.) and the "junior dev/ops" you who jut started his first day. Whatever you're working on needs to support both variants of you (with appropriate safeguards, checks and balances in place for junior you).

    E.g. when deciding how to backup your prod database, if you're thinking as both "personas" you'll come up with a strategy that safely backs up the database but also makes it easy for a non-privileged user to securely suck down a (optionally sanitised) version of the latest snapshot for seeding their dev environment with [ and then dog food this by using the process to seed your own dev environment ].

    Some other quick & easy things:

    - Design your terraform/ansible/whatever scripts such that anything touching sensitive parts needs out of band ssh keys or credentials. E.g. if you have a terraform script that brings up your prod environment on AWS, make sure that the aws credentials file it needs isn't auto-provisioned alongside the script. Instead write down on a wiki somewhere that the team member (at the moment, you) who has authority to run that terraform script needs to manually drop his AWS credentials file in directory /x/y/z before running the script. Same goes for ansible: control and limit which ssh keys can login to different machines (don't use a single "devops" key that everyone shares and imports in to their keychains!). Think about what you'll need to do if a particular key gets compromised or a person leaves the team.

    - Make sure your backups are working, taken regularly, stored in multiple places and encrypted before they leave the box being backed up. Borgbackup and rsync.net are a nice, easy solution for this.

    - Make sure you test your backups!

    - Don't check passwords/credentials in to source code without first encrypting them.

    - Use sane, safe defaults in all scripts. Like another poster mentioned, don't do if env != "test"; do prod_stuff();

    - RTFM and spend the extra 20 minutes to set things up correctly and securely rather than walking away the second you've got something "working" (thinking 'I'll come back later to tidy this up' - you never will).

    - Follow the usual security guidelines: firewall machines (internally and externally), limit privileges, keep packages up to date, layer your defences, use a bastion machine for access to your hosted infrastructure

    - Get in the habit of documenting shit so you can quickly put together a straight forward on-boarding/ops wiki in a few days if you suddenly do hire a junior dev (or just to help yourself when you're scratching your head 6 months later wondering why you did something a certain way)

The author should get their own legal in line - does the contract even allow termination on the spot. If not, the employer is just adding to their own pile of ridiculous mistakes.

  • Probably. At will employment is pretty common in the US.

    • Even in Europe it's pretty lenient for the first period. Different countries obviously have different maximum probation periods but day 1 would fall into it in most (all?) of them

    • Are we sure its the USA some of the language used in the poor guy's post on redit implies a non native speaker I am guessing India which is known for treating employees "horrifically".

One of the questions I asked my manager during the interview process was how did he feel about mistakes?

I knew I was being brought in to rearchitect the entire development process for an IT department and that I would make architectural mistakes no matter how careful I was and that I would probably make mistakes that would have to be explained to CxOs.

Whatever the answer he gave me, I remember being satisfied with it.

Reminds me of my first dev job, when I got a call during lunch:

"The server has been down all day, and you are the only one who hasn't noticed. What did you break?"

"Well, I saw that all the files were moved to `/var/www/`, and figured it was on purpose."

Suffice it to say, I got that business to go from Filezilla as root to BitBucket with git and some update scripts.

Something tells me their production password was nothing like a 20-char random string...

I am the only one who is surprised that he can get the keys to the kingdom on day 1?

Day 1 is when you setup your desk and get your login. Then go back to HR to do the last hiring paperwork.

It should take a good week before a new employee is able to fuck up anything. Really.

  • How long do you want to adjust the height of your chair? Setting up the dev enviroment often takes ages. Why wouldn't it be the first thing to do? There will be enough progress bars while updating something like visual studio to finde time to re-adjust the chair.

I did the same thing early on in my career. Shut down several major ski-resorts in Sweden for an entire day during booking season by doing what we always did, running untested code in production to put out fires. Luckily, my company and our customers took that as a cue to tighten up the procedures instead of finding someone to blame. I hear this is how it works in aviation as well, no one gets blamed for mistakes since that only prevents them from being dealt with properly. Most of us are humans, humans make mistakes. The goal is to minimize the risk for mistakes.

I stopped believing reddit posts a long time ago

  • Exactly, the post is very clichéd. I have about 75% belief that it's fictional. I guess it could be sort of entertaining to see how easy it is to get a few hundred software engineers on reddit and hacker news worked up into a sympathetic and self-righteous frenzy with a simple and entirely fictional paragraph posted for free from a throwaway account.

    • I am about 101% it's fake. "Unfortunately apparently those values were actually for the production database (why they are documented in the dev setup guide i have no idea)" - yeah, no. Had you told me you were able to screw the production db up because it had no su password set, you might have got me. But this is bullshit.

Technical infrastructure is often the ultimate in hostile work environments. Every edge is sharp, and great fire-breathing dragons hide in the most innocuous of places. If it's your shop, then you are going to have a basic understanding of the safety pitfalls, but you're going to have no clue as to the true severity of the situation.

If you introduce a junior dev into this environment, then it's him that is going to discover those pitfalls, in the most catastrophic ways possible. But even experienced developers can blunder into pitfalls. At least twice I've accidentally deployed to production, or otherwise ran a powerful command intended to be used in a development environment on production.

Each time, I carefully analyze the steps that led up to running that command and implemented safety checks to keep that from happening again. I put all of my configuration into a single environment file so I see with a glance the state of my shop. I make little tweaks to the project all the time to maintain this, which can be difficult because the three devs on the project work in slightly different ways and the codebase has to be able to accommodate all of us.

While this is all well and good, my project has a positively decadent level of funding. I can lavish all the time I want in making my shop nice and pretty and safe.

A growing business concern can not afford to hire junior devs fresh out of code school / college. That's the real problem here. Not the CTO's incompetence, any new-ish CTO in a startup is going to be incompetent.

The startup simply hired too fast.

  • The same thing could happen to senior. In particular, to tired overworked senior. It is more likely to happen to junior, because junior is likely to be overwhelmed. However, mistakes like this happen to prole of all ages and experience levels.

    Seniority is what makes you not put the damm password into set up document. That was the inexperienced level of mistake. Forgotting to replace it while you are seting up day one machine is mistake that can happen to anyone.

    • True, but a senior engineer, even if he is never able to make architecture decisions, can still be held accountable for knowing better. That is precisely why they are paying him the big bucks.

      If a shop is being held together with duct tape and elbow grease, then you should have known that going in, and developed personal habits to avoid this sort of thing. Being overworked and tired isn't an excuse. Sure, the company and investors have to bear the real consequences, but you as an IC can't disclaim responsibility.

This company has a completely different problem: no separation of duties. Start with talking to the CTO how this could have happened in the first place, re-hire the junior dev.

After all, if the junior dev could do it, so can everybody else (and whoever manages to get their account credentials).

When it comes to backup, there are two types of people, ones who do backup and ones who will do backup.

This is purely the fault of the entire leadership stack.

From Sr dev/lead dev, dev manager, architect, ops stack, all the directors, A/S/VPs, and finally the CTO. You could even blame the CEO for not knowing how to manage or qualify a CTO. Even more embarrassing is if your company is a tech company.

I think a proper due diligence would find the fault in the existing company.

It is not secure to give production access and passwords to a junior dev. And if you do, you put controls in place. I think if there is insurance in place some of the requirements would have to be reasonable access controls.

This company might find itself sued by customers for their prior and obviously premeditated negligence from lack of access controls (the doc, the fact they told you 'how' to handle the doc).

  • The Junior dev does bear a small amount of blame, if you really want to go the blameful route.

    But figuring out who to blame is toxic. You've got to go for a blameless culture and instead focus on post mortems and following new and better processes.

    Things can absolutely always go to shit no matter where you work or how stupidly they went to shit. What differentiates good companies from bad ones is whether they try to maximize the learning from the incident or not.

Ahhhhh haaaa yeah.....I've done that.

It was the second day, and I only wiped out a column from a table, but it was enough to bring business for several hundred people down for a few hours. It was embarrassing all round really. Live and learn though - at least I didn't get fired!

Obviously this is mostly CTO's screw up.

But the junior dev is not fully innocent either: he should have been careful about following instructions.

For extra points (to prove that he is a good developer) - he should have caught that screw up with username/passwords in the instruction. Here's approximate line of reasoning:

---

What is that username in the instruction responsible for? For production environment? Then what would happen if I actually run my setup script in production environment? Production database would be wiped? Shouldn't we update setup instruction and some other practices in order to make sure it would not happen by accident?).

---

But he it is very unlikely that this junior dev would be legally responsible for the screw up.

I destroyed an accounting database at a company during a high school summer job.

A mentor was supervising me and continually told me to work slower but I was doing great performing some minor maintenance on a Clipper application and didn't even need his "stupid" help ... until I typed 'del .db' instead of 'del .bak'. Oooops!

Luckily the woman whose computer we were working on clicked 'Backup my data' every single day before going home, bless her heart, and we could copy the database back from a backup folder. A 16 year old me was left utterly embarrassed and cured of flaunting his 1337 skillz.

Obviously not the new engineer's fault. Unfortunately, aspects of this are incredibly common. On three jobs I've had, I've had full production access on day one. By that, I mean everyone had it...

After adding up the number of egerious errors made by the company, I'd almost be inclined to say the employee has grounds for wrongful termination or at least fraudulent representation to recoup moving expenses.

He's / she's better off not working at this place. So many things wrong. Not having a backup is the number 1 thing.

I could see having a backup that is hours old, and losing many hours of data, but not everything.

Even startups have contracts with their customers about protecting the customer's data. If it is consumer data, there are even stricter privacy laws. Leaving the production database password lying around in plain text is probably explicitly prohibited by the contracts, and certainly by privacy laws. The CTO should pay him for the rest of the year and give him a great reference for his next job, in return for him to never, ever, ever, tell anyone where he found the production password.

Here's why I think this is fake:

A company with 40 devs and over 100 employees that lost an entire production db would have surfaced here from the downtime. Other devs would corroborate the story.

  • I'm also skeptical, but this isn't necessarily true. There's plenty of software being written outside the HN bubble that's totally invisible to us. What if this was some shipping logistics company in Texas City? We'd never know about it; they wouldn't have a trendy dev blog on Medium.

I always wonder, why IT companies don't test their backups? Even if it's the prod db, it should be tested on a regular base. No blame to the dev.

We were paying for RDS right from when we were a 2 man startup. Zero reasons to not have a dB service that is backed frequently by a competent team.

He needs to return the laptop asap, like now. They are in full emotional mode and can overreact to what they might perceive as another bad act too.

I don't work in tech but I'm an avid HN reader.

I'm surprised a junior dev on his first day isn't buddied up with an existing team member.

In my line of work, an existing employee who Transferred from another location would probably be thrown in at the deep end but someone who is new would spend some time working alongside someone who is friendly and knowledgable. This seems the decent thing to do as humans.

Yeah this infra/config management sounds like land-mine / time bomb incompetence territory. You just were the unlucky one to trigger it. Luckily this gives you an opportunity to work elsewhere and hopefully be in a better place to learn some good practices - which is really what you're after as a junior dev anyway.

Lucky junior dev! He has figured out a bad company to work for in his first work day. Good luck finding a new job!

  • Also, this is going to look great on their resumé, and be the perfect response to the "tell us a time when you made a mistake" interview question.

Everybody agrees that the instructions shouldn't have even had credentials for the production database, and the lion's share of the blame goes to whoever was responsible for that.

There is still a valuable lesson for the developer here though - double check everything, and don't screw up. Over the course of a programming career, there will be times when you're operating directly on a production environment, and one misstep can spell disaster - meaning you need to follow plans and instructions precisely.

Setting up your development environment on your first day shouldn't be one of those times, but those times do exist. Over the course of a job or career at a stable company, it's generally not the "rockstar" developers and risk-takers that ahead, it's the slow and steady people that take the extra time and never mess up.

Although firing this guy seems really harsh, especially as he had just moved and everything, the thought process of the company was probably not so much that he messed up the database that day, but that they'd never be able to trust him with actual production work down the line.

  • No, sorry, and it's important to address this line of thinking because it goes strongly against what our top engineering cultures have learned about building robust systems.

    > Over the course of a programming career, there will be times when you're operating directly on a production environment, and one misstep can spell disaster

    These times should be extremely rare, and even in this case, they should've had backups that worked. The idea is to reduce the ability of anyone to destroy the system, not to "just be extra careful when doing something that could destroy the system."

    > Although firing this guy seems really harsh, especially as he had just moved and everything, the thought process of the company was probably not so much that he messed up the database that day, but that they'd never be able to trust him with actual production work down the line.

    Which tells me that this company will have issues again. Look at any high functioning high risk environment and look at the way they handle accidents, especially in manufacturing. You need to look at the overarching system than enabled this behavior, not isolate it down to the single person who happened to be the guy to make the mistake today. If someone has a long track record of constantly fucking up, yeah sure, maybe it's time for them to move on, but it's very easy to see how anyone could make this mistake and so the solution needs to be to fix the system not the individual.

    In fact I'd even thank the individual in this case for pointing out a disastrous flaw in the processes today rather than tomorrow, when it's one more day's worth of expensiveness to fix.

    Take a look at this: https://codeascraft.com/2012/05/22/blameless-postmortems/

    • I violently agree with you.

      All I'm saying is that there are times when it is vital to get things right. Maybe it's only once every 5 or 10 years in a DR scenario, but those times do exist. Definitely this company is incompetent, deserves to go out of business, and the developer did himself a favor by not working there long-term, although the mechanism wasn't ideal.

      I'm just saying that the blame is about 99.9% with the company, and 0.1% for the developer - there is still a lesson here for the developer - i.e., take care when executing instructions, and don't rely on other people to have gotten everything right and to have made it impossible for you to mess up. I don't see it as 100% and 0%, and arguing that the developer is 0% responsible denies them a learning opportunity.

      2 replies →

  • While working on AWS, we had data corruption caused by a new feature launch. Deployments took ~6 weeks so the solution was to use GDB to flip a feature flag in memory for about 120k servers.

  • > There is still a valuable lesson for the developer here though - double check everything, and don't screw up.

    "Double check everything" is a good lesson, because we all can and should practice it.

    "Don't screw up" is not useful advice because it's impossible. There's a reason we don't work like that... Who needs backups? Just don't screw anything up! Staging environment? Bah, just don't screw up deployments! Restricted root access? Nah, just never type the wrong command. No, we need systems that mitigate humans screwing up, because it will happen.

  • > the thought process of the company was probably not so much that he messed up the database that day, but that they'd never be able to trust him with actual production work down the line.

    I think that they simply acted emotionally and out of fear, anger and stress. The vague legal threat and otherwise ignoring this dude bother suggest it. The way events unfolded, it does not sound like much rational thinking was involved.

Cool story but I think this is fake. Since there are 40 people in the company, it seems like at least a few people before him followed the onboarding instruction. I just don't believe that there would be that many people that a) didn't do the same thing he did or b) change the document.

Repeat after me, while clicking your heels together three times, "It is not my fault. It is not my fault. It is not my fault." It was obvious as I read your account that you would be fired. A company that allowed this scenario to unfold would not understand that is was their fault.

I was only granted read-only access to the Prod DB last week, after achieving 6 months of seniority.

I would assume this was mocked to test if the intern could follow simple instructions, to provide a lecture for the huge consequences of small mistakes and to have a viable reason to fire consequently; but I'm wearing my tin foil hat right now, too.

It is really unfair to have fired him. The OP is not the one that sould have been fired. The guy in charge of the db should be fired and the manager who fired the OP should be fired too. And, by the way, the guy in charge of the backups too.

I would suggest you, once this sorted out, to publicly mention the company name so no other Engineer will fail in this trap again. This will be lesson for them to properly follow basic practices for data storage.

Unfortunately, software companies like that are everywhere, the guy is learning and screws up a terribly designed system, the blame is on the "senior" engineers that set up that enviroment.

My question is, why in the world did they publish someone's production credentials in an onboarding document? That has to be a SOX compliance violation at the very least.

The CTO should be fired immediately!

If I didn't read wrong, they write poduction db credential in first day local dev env instruction! WTF.

This CTO sounds to me even worse than this junior developer.

So a script practically set up the machine with the nuclear football by default, and then you where expected to diffuse it before using it. That is not your fault.

I have a feeling the CTO was actually one of those "I just graduated bootcamp and started a website, so I can inflate my title 10x" types.

Should have job title changed to Junior Penetration Tester and be rewarded for exposing an outfit of highly questionable competence.

Firing the guy seems drastic but understandable. Implying that they are going to take legal action against him is ridiculous!

So the company's fault. Embarrassing they tried to blame the new guy. So many things wrong with this.

Wow. What a train wreck. This is why the documentation I write contains database URIs like:

USER@HOST:PORT/SCHEMA

It was their fault, plain and simple.

  • How is it the FNGs fault that they have no backups, no DR plan, and production DB details freely available and in the setup guide? The company is entirely at fault.

  • Did you read the details? I disagree. The dev is probably better off not working in such a volitile environment. They'll be better off working somewhere where they can learn some best practices, possibly somewhere that doesn't have the possibility of wiping a production DB because you ran some tests from a developer's machine. That's incorrigible.

  • > It was their fault, plain and simple.

    <off-topic>A wonderful example of the shortcoming of the singular "they"... :P

I really suggest that OP sends this thread to HR & others. And this isn't sarcasm

Either the CTO and his dev team are ridiculously stupid, or this was on purpose.

Lots of people in the thread are commenting how surprised they are that a junior dev has access to production db. Both jobs I've had since graduating gave me more or less complete access to production systems from day one. I think in startup land - where devops takes a back seat to product - it's probably very common.

  • I work for a large bank as OPS engineer. The idea that I could even read a production database without password approval from someone else is too crazy to consider. Updating or deleting takes a small committee and a sizeable "paper trail" to approve.

    Sometimes when I read stories like these, I think it's no wonder a company like WhatsApp can have a billion customers with less than 100 employees. And then I make some backups to get that cozy safe feeling again.

  • Which doesn't abdicate responsibility from the CTO of the company to have practices in place that could have prevented this. While I'm going to hold my breath on being threatened with legal claims, to be fired for something that any person in the building could have done doesn't sound like a conducive environment to work in.

  • > I think in startup land - where devops takes a back seat to product - it's probably very common.

    It is, but it need not be. It's pretty easy to set up at least some backup system in such a way that whoever can wipe the production systems can't also wipe the back-up.

  • Its a pretty gross error to me to have direct db access. Obviously in any stack you could push code that affects the db catastrophically anyway, but in dev mode you should never connect directly to production database, not only for this error but for general data integrity.

    CTO needs to put a theatre to not get fired, because he is ultimately responsible.

  • As a junior noc analyst at a Fortune 500 company, I had root access to almost the entire corporate infrastructure from day one. Databases, front ends, provisioning tools, everything.

    • It's not about having access to the production database. It's about having an example script that can do catastrophic things and having the production username and password in the example docs.

      I also had production access to everything from day one. The first thing I did was setup the hosts files in the various dev servers - including my own computer so I couldn't access the databases from them.

      I have to remote into another computer to access them.

  • In that case don't you think they should informed him that this is a production environment so don't fuck it up? Giving your junior dev a front seat on your production without proper communication is a disaster waiting to happen.

  • I think in startup land - where devops takes a back seat to product - it's probably very common

    Perhaps with hipster databases like MongoDB that are insecure out of the box, but most grown-up, sensible DBs have the concept of read-only users, and also it is trivial to set up such that you can e.g. DELETE data in a table without being able to DROP that table.

    I'll wager any startup that does like you say has devs that do SELECT * FROM TABLE; on a 20-col, million+ row table only actually wanting one value from one row, but they filter in their own code... Yes I have seen this more times than I care to count.

  • I agree.

    But not in the commands to run in a local-dev-setup-guide that purges the db it points to.

    If anyone should be fired for that, it's the CTO. He must suck at his job at the very least, and junior dev should get an apology.

  • > I think in startup land - where devops takes a back seat to product - it's probably very common.

    Not focusing on devops and putting your prod db credentials in plain sights are VERY different things. It's really really easy to do, especially if you are using Heroku or something like that. Same thing goes with database backups. I worked at multiple startups (YC and non) and they all had the basics nailed down, even when they were just 3-5 employees

  • The problem isn't so much that a junior had access to the production Db. The problem is that the junior's dev setup had access to the production Db and could nuke the whole thing with a few misplaced keystrokes. I'm working on a product currently where I am the only dev. I have a pretty large production Db. I also have a smaller clone of that same Db on my local machine for development purposes. I can only access the production Db by directly shelling into the machine it's running on or performing management commands on one of the production worker machines(which I also need to shell into). This was not very difficult to set up and ensures that my development environment cannot in any way affect the production environment.

    Also, why even distribute the production credentials at all? Only the most senior DBAs or devs should have access to production credentials.

  • I've done about ~40 or so technology due diligence projects for investors of tech companies. You'd be amazed how many security flaws there are out there. One of the most simple ones - storing production credentials in the git repo.

  • Sure, there exist reasons to do that. It's still a bad idea, but, ok.

    But there is no reason to write the production DB credentials in a document, especially as an "example". That is monumentally dumb. It amounts to asking for this to happen.

  • We give everyone access to production systems, but even if someone deleted everything from production, we can restore everything in ~20 minutes (this has happened), and if that process fails, we have backups on s3 that can be restored in a couple of hours (and this is tested regularly, but thankfully hasn't happened yet), and even if that fails...

    There's a reason why it's called disaster recovery and prevention.

    • Why try to justify stupid behavior and absent security controls with the idea that your downtime is "only" 20 minutes? How silly.

  • I've only had one job after college, and I am still there almost a year after being hired. The first few months I only had access to my own local copy of the production DB. Though there's a reason or two why I wouldn't be outright trusted one of them stemming from me being a junior.

  • Wow, really?!? I've never granted access to prod databases on day one, junior or not.

    I thought that was just SOP.

    • Can you send me your password again, I forgot it. Also, please reply to those emails from emily - or just delete them... they are cluttering up your inbox, and I am tired of having to sort through your guys' personal crap.

      Thanks

  • Same. I had instructions from a competent developer however. I would still blame whoever allowed production access as part of application setup, as well as the fact there isn't a process to back up this production data.

  • It's not that this shouldn't happen, but that it does happen and has to be dealt with as the potential impact scales up. Having production creds on day 1 isn't the same as day 500.

  • Perhaps, but in 2017 it is gross fiduciary malpractice not to have backup systems in place for production data and code. It would be grounds for a shareholder suit against the principals.

  • Common does not mean it's correct or has to be or whatever. Trump thinks climate change is hoax, does not mean that it's convincing...

Name and shame.. The CTO stinks of incompetence, and surprised he/she has managed to retain any competent staff (perhaps he/she actually hasn't). What a douchebag. You are not to blame.

I worked with someone who did this, early in my career. His bacon was saved by the fact that a backup had happened very soon before his mistake.

His was worse though, because he had specifically written a script to nuke all the data in the DB, intending it for test DBs of course. But after all that work, he was careless and ran it against the live DB.

It was actually kind of enlightening to watch, because he was considered the "genius" or whatever of my cohort. To wit, there are different kinds of intelligence.

I fucked up a table once by setting the column of every record to true, but I had asked about changing the code to require a manual sql query a few weeks prior so it could have been prevented.

This story can't be true. If it is, obviously the junior dev is much better off working elsewhere. Why is this company in business and hiring people?

People are designed to make mistakes. We should learn from them and try to be more understanding.

small shops aren't always perfect. The engineering team should not allow junior devs to hit DB. If you are that vulnerable to coding mistakes you shouldn't be hiring junior devs.

Understandable that hey got fired. I image there would have been a quite emotional response from the business when this happened, but that doesn't mean it was necessarily the most appropriate response.

--Unfortunately apparently those values were actually for the production database (why they are documented in the dev setup guide i have no idea).--

Someone else should have been fired if this is true.

  • No, it's inappropriate. When the process fails you don't fire the junior employee that showed you just how incompetent the organization is. You fix the problem.

    Firing this guy does nothing, fixing the problem does, but requires those higher up to admit the mistake was theirs to begin with.

    • I should clarify, I don't think its appropriate, But I see why it happened. The business was panicking. His firing was kinda on the cards. Probably a blessing in disguise for this guy.