Comment by tacostakohashi

8 years ago

Everybody agrees that the instructions shouldn't have even had credentials for the production database, and the lion's share of the blame goes to whoever was responsible for that.

There is still a valuable lesson for the developer here though - double check everything, and don't screw up. Over the course of a programming career, there will be times when you're operating directly on a production environment, and one misstep can spell disaster - meaning you need to follow plans and instructions precisely.

Setting up your development environment on your first day shouldn't be one of those times, but those times do exist. Over the course of a job or career at a stable company, it's generally not the "rockstar" developers and risk-takers that ahead, it's the slow and steady people that take the extra time and never mess up.

Although firing this guy seems really harsh, especially as he had just moved and everything, the thought process of the company was probably not so much that he messed up the database that day, but that they'd never be able to trust him with actual production work down the line.

7 comments

tacostakohashi

wdewind 8 years ago

No, sorry, and it's important to address this line of thinking because it goes strongly against what our top engineering cultures have learned about building robust systems.

> Over the course of a programming career, there will be times when you're operating directly on a production environment, and one misstep can spell disaster

These times should be extremely rare, and even in this case, they should've had backups that worked. The idea is to reduce the ability of anyone to destroy the system, not to "just be extra careful when doing something that could destroy the system."

> Although firing this guy seems really harsh, especially as he had just moved and everything, the thought process of the company was probably not so much that he messed up the database that day, but that they'd never be able to trust him with actual production work down the line.

Which tells me that this company will have issues again. Look at any high functioning high risk environment and look at the way they handle accidents, especially in manufacturing. You need to look at the overarching system than enabled this behavior, not isolate it down to the single person who happened to be the guy to make the mistake today. If someone has a long track record of constantly fucking up, yeah sure, maybe it's time for them to move on, but it's very easy to see how anyone could make this mistake and so the solution needs to be to fix the system not the individual.

In fact I'd even thank the individual in this case for pointing out a disastrous flaw in the processes today rather than tomorrow, when it's one more day's worth of expensiveness to fix.

Take a look at this: https://codeascraft.com/2012/05/22/blameless-postmortems/

tacostakohashi 8 years ago
I violently agree with you.
All I'm saying is that there are times when it is vital to get things right. Maybe it's only once every 5 or 10 years in a DR scenario, but those times do exist. Definitely this company is incompetent, deserves to go out of business, and the developer did himself a favor by not working there long-term, although the mechanism wasn't ideal.
I'm just saying that the blame is about 99.9% with the company, and 0.1% for the developer - there is still a lesson here for the developer - i.e., take care when executing instructions, and don't rely on other people to have gotten everything right and to have made it impossible for you to mess up. I don't see it as 100% and 0%, and arguing that the developer is 0% responsible denies them a learning opportunity.
- samstave 8 years ago
  
  Well, sure... but you cant expect one transitioning from intern status to first-real-job status to have the forethought of a 20-year veteran, nor should that intern/employee have the expectation that the company who is ostensibly to mentor him in the very beginnings of his career, would have such a poor security stance as to have literal prod creds in an on-boarding document, let alone not relegating whatever he was on-boarding with to a sandbox with absolutely no access to anything.
- wdewind 8 years ago
  
  Not to be pedantic, but the fact that you are literally assigning percentage blames to entities means you do not, in fact, violently agree with me. Read the article I posted and you'll see why it is so important not to assign blame at all.

bm1362 8 years ago

While working on AWS, we had data corruption caused by a new feature launch. Deployments took ~6 weeks so the solution was to use GDB to flip a feature flag in memory for about 120k servers.

yjftsjthsd-h 8 years ago

> There is still a valuable lesson for the developer here though - double check everything, and don't screw up.

"Double check everything" is a good lesson, because we all can and should practice it.

"Don't screw up" is not useful advice because it's impossible. There's a reason we don't work like that... Who needs backups? Just don't screw anything up! Staging environment? Bah, just don't screw up deployments! Restricted root access? Nah, just never type the wrong command. No, we need systems that mitigate humans screwing up, because it will happen.

watwut 8 years ago

> the thought process of the company was probably not so much that he messed up the database that day, but that they'd never be able to trust him with actual production work down the line.

I think that they simply acted emotionally and out of fear, anger and stress. The vague legal threat and otherwise ignoring this dude bother suggest it. The way events unfolded, it does not sound like much rational thinking was involved.