← Back to context

Comment by champagnepapi

8 years ago

I agree, it's the fault of the CTO. To me, the CTO sounds pretty incompetent. The junior engineer did them a favor. This company seems like it is an amateur hour operation, since data was deleted so easily by an junior engineer.

Yup, I've heard stories of junior engineers causing millions of dollars worth out outages. In those case the process was drilled into, the control that caused it fixed and the engineer was not given a reprimand.

If you have an engineer that goes though that and shows real remorse your going to have someone who's never going to make that mistake(or similar ones) again.

  • Agreed. Several years ago as a junior dev I was tasked with adding a new feature- only allowing a user to have 1 active session.

    So, we added a "roadblock" post auth with 2 actions- log out other sessions and log out this session.

    Well, the db query for the first action (log out other sessions) was missing a where clause...a user_id!

    Tickets started pouring in saying users were logged out and didn't know why. Luckily the on-call dev knew there was a recent release and was able to identify the missing where clause and added it within the hour.

    The feature made it through code review, so the team acknowledged that everyone was at fault. Instead of being reprimanded, we decided to revamp our code review process.

    I never made that kind of mistake again. To this day, I'm a little paranoid about update/delete queries.

    • We all make this mistake eventually, often in far more spectacular fashion. My lessons learned are

      1) Always have a USE statement (or equivalent);

      2) Always start UPDATE or DELETE queries by writing them as SELECT;

      3) Get in the habit of writing the WHERE clause first;

      4) If your SQL authoring environment supports the dangerous and seductive feature where you can select some text in the window and then run only that selected text — beware! and

      5) While developing a query to manipulate real data, consider topping the whole thing with BEGIN TRANSACTION (or equivalent), with both COMMIT and ROLLBACK at the end, both commented out (this is the one case where I use the run-selected-area feature: after evaluating results, select either the COMMIT or the ROLLBACK, and run-selected).

      Not all of these apply to writing queries that will live in an application, and I don't do all these things all the time — but I try to take this stance when approaching writing meaningful SQL.

      8 replies →

    • > Luckily the on-call dev knew there was a recent release and was able to identify the missing where clause and added it within the hour.

      Raises questions about deployment. Didn't the on-call have a previous build they could rollback to? Customers shouldn't have been left with an issue while someone tried to hunt down the bug (which "luckily" they located), instead the first step should have been a rollback to a known good build and then the bug tracked down before a re-release (e.g. review all changesets in the latest).

      1 reply →

    • UPDATE cashouts SET amount=0.00 <Accidental ENTER>

      Oops. That was missing a 'WHERE user_id=X'. Did not lose the client at the time (this was 15+ years ago), but that was a rough month. Haven't made that mistake again. It happens to all of us at some point though.

      2 replies →

    • I'm guessing this feature was never tested properly

      we all assume code (or feature) that are not tested should be assumed to be broken

  • At a former employer, we had a Megabuck Club scoreboard; you got your name, photo and a quick outline of what your (very expensive!) mistake had been posted on it. Terrific idea, as:

    a) The culture was very forgiving of honest mistakes; they were seen as a learning opportunity.

    b) Posting a synopsis of your cockup made it easier for others to avoid the same mistake while we were busy making sure it would be impossible to repeat it in the future; also, it got us thinking of other, related failure modes.

    c) My oh my was it entertaining! Pretty much the very definition of edutainment, methinks.

    My only gripe with it was that I never made the honor roll...

    • We had something similar at one of my jobs, its hard to relay it in text but it was really a fun thing. Mind you this is at a fortune 100 company and the CIO would come down for the ceremony when we did it, to pass the award. We called it the build breakers award and we had a huge trophy made up at a local shop, with a toilet bowl on it. If you broke the build and took down other developers then the passing of the award ceremony was initiated, I would ping the CIO (as it was my dev shop) he would come down, and do a whole sarcastic speech about how the wasted money was all worth it because the developers got some screw off time, while the breaker fixed the build. It was all in good spirit though and people could not wait to get that trophy off their desk, it helped that the thing was probably as big as the Stanley cup.

      1 reply →

Yep. I had a junior working for me once a few years ago that made a rather unfortunate error in production which deleted all of several customers' data. I could tell he was on pins and needles when he brought it to me, so I let him off the hook right away and showed him the procedures to fix the issue. He said something about being thankful there was a way to fix the problem, and I just smiled and told him A) it would have been my fault if there hadn't been; and B) he wouldn't have had the access he did without safeguards in place. Then I told him a story about the time I managed to accidentally delete an entire database of quarantined email from a spam appliance I was working on several years earlier. Sadly, my CTO at the time did NOT prepare for that.

I lost a whole weekend of sleep in recovering that one from logs, and that was when I learned some good tricks for ensuring recoverability....

Agreed. Also, why didn't they have a backup of some sort? The hard drive on the server could have failed and it would have been just as bad.

Sounds like an incompetent set of people running the production server.

  • As a lot of companies I bet they HAVE backup, just never tested if the backup process works. It is absurdly common...

    • This is trivial tho'. Just setup a regular refresh of the dev env via the backup system. Sure it takes longer because you have to read the tapes back but it's worth it for the peace of mind, and it means that every dev knows how to get at the tapes if they need to.

      1 reply →

    • From my experience, if I do not test my backups they stop working in about 1 year. So if I do not test backups for over a year then my assumption is that I probably do not have working backups.

    • Most likely something like this. There is probably backup software running but it's either nothing but failed jobs or misconfigured so the backups aren't working correctly.

      3 replies →

  • OP says the team is 40+ and CTO just let them all walk on a catwalk.