← Back to context

Comment by atonse

4 years ago

Aww I’m feeling bad for the poor engineer who is saying “crap” a thousand times now, or is blissfully unaware that their day is about to get much worse.

Once during a system upgrade we ran some scripts and they triggered emails to about 8,000 people before we realized it (would’ve been 150k people otherwise). The next day was all about clean up, sending an apology email etc.

My mother had visited the next day and asked why I wasn’t hanging out and my son (6 at the time) said, totally unphased, “oh he can’t come now because he’s saying sorry to 8,000 people”

Hope you get through it! These mistakes happen. Oh, and your test passed. :-)

That engineer should absolutely point out that despite the mistake, he just created the most engaging newsletter email ever in defiance of modern marketing practices. The real cosmic brain strategy is to run with this somehow and turn it into something like those guerrilla marketing campaigns.

  • that’s correct.

    happened to me as well, we send out newsletter to a b2b ecom site with all links to staging, behind htacess of course: so no tracking pixels, no images etc.

    some complaints, lots of „hey you did something wrong - even more of „i can’t open it please send again“ — and it was the best week in terms of sales ever.

Similar story... about 10 years ago, I had written a really simple script to email all our customers. It worked great for a long time, but then suddenly we went over 1000 customers.

My script was supposed to try to grab batches of 1000 customers and keep looping until it ran out of customers (signaled by having retrieved less than 1000 customers in my last request for the next batch of 1000 customers).

My script was missing the offset part of the query, so after we hit 1000 users, it just kept looping, sending the same email over and over to our first 1000 users.

I felt so bad that day. From then on, sending out emails was this whole huge process that involved queuing them all and then having like 6 people review to make sure we didn't mess it up.

  • When I worked on email systems, my worst nightmare was a Sorcerers Apprentice sort of problem.

  • oh man I just wrote a piece of code that lets me write any markdown, push a button and it just sends it out to our ~1-2000 users.

    An hour later I just commented it all out, and wrote a note to myself: "if you need this, uncomment and push back to stage". Just having that code even sitting around makes me nervous

  • We introduced an allowlist on all our testing and staging environments to ensure that only certain recipients can get email. We also make sure that no email address in these databases would work, unless we really want to send to it.

  • I always was super careful with that stuff. The final send function wasnt unlocked until knew for certain all worked as intended.

I sympathize because when I was a junior engineer once I accidentally emptied the test email spool sending the entire company including upper management all sorts of fake test emails accumulated in the past years about hires and fires and whatnot. What is more, a coworker convinced an HR person to play a prank and call me to HR office next day for giggles. Needless to say it was stressful and infuriating at the same time.

Now I know better and if somehow they are reading here, I would advise the person to just chill and don't take any shit than necessary. If you were not one of the few people with root access but somehow still had the capabilities for mass emailing in prod, that is not your problem, it is an organizational problem. For an operation at the size of HBO, anything prod has to be behind sufficient failsafes and a peer reviewed process (except maybe for a very rare "break glass" emergency).

Hope there will be a good, rational postmortem that can cool headedly identify the root causes and create action items for the actual stakeholders. If your shop is worth its salt, there wouldn't be performance evaluation consequences for you. If there is, no worries either, it is time to look for a better place.

  • > What is more, a coworker convinced an HR person to play a prank and call me to HR office next day for giggles.

    That's the type of thing HR people should be putting a stop to, not literally being a party to. I don't have any illusions about HR being there for the employee rather than the employer, but I can't imagine working for a place where HR is abusing their authority to add stress and shame solely for their own amusement.

> Aww I’m feeling bad for the poor engineer who is saying “crap” a thousand times now, or is blissfully unaware that their day is about to get much worse.

I hope not. It sounds like the test database was not being anonymized but sometimes things like this can be as simple as not selecting a debug build in Visual Studio, either of which is an organizational issue and not an individual one so he shouldn't be punished.

This is especially true if you have corporate buzzwords like "taking ownership and responsibility". No one will take responsibility if there are punishments for owning up to and admitting mistakes. Odds are they feel pretty bad about it already.

While we're sharing personal anecdotes. Parents get very upset when they incorrectly receive truancy reports because you forgot to check the IsDeceased flag...

  • I hope not. It sounds like the test database was not being anonymized

    Taking a copy of a production database and using it for tests is a bad idea, even if you believe you're expunging any private user data.

    Development, staging, and test environments just shouldn't ever have access to production data. If you're at a company that's ISO27001 certified for data security it even goes as far as most employees not having any access to data. I've never seen any production data for the app I work on.

    https://en.m.wikipedia.org/wiki/ISO/IEC_27001

    • I agree about the part of not accessing information from production.

      But I am wondering how could we debug or test something which happens only on production? I ask this because there are some bugs that can appear at the intersection of code and data.

      So far my strategy is to do the following:

      1. Only one person can access production DB. This person will do a backup copy and encrypt it to an internal storage.

      2. Another one will get the backup and run an anonomizer script on data. The anonimzer is still up to debate what it should do after the obvious cleaning of personal data from user accounts. One important (and hard step) is regenerating the uuids but keeping foreign keys integrity.

      At the end this person will create a new DB internally with the anonimizer data.

      3. Someome reviews the new DB and marks it as ready to be used

      Then a dev can ask access to this fresh copy.

      In some teams I played with making this process full automated until review. But then if there are bugs suddenly we have a live internal DB with customer data which is not wanted.

      As an alternative but only for small projects I wrote once a script which analysis the DB data and tries to create fro, scratch a similar data structure but with fake data.

      4 replies →

    • > I've never seen any production data for the app I work on.

      The rest I agree with you, at least in a perfect world, but not allowed to look at production data? In the jobs I've had recently I wouldn't even be able to hypothesize what the problem is without looking at production data and production logs. Some of the issues wouldn't even have been reported if I wasn't checking the logs.

      How do you bridge the gap from problem to replication and/or something actionable? Do you have someone knowledgeable enough in a role where they can feed you this information?

      3 replies →

    • I'm also in a similar situation, but in my case, i cannot even get access to the application logs unless i explicitly ask for them (typically as a part of solving a certain problem, given a time of occurrence), same for APM data.

      While there's certainly something good to be said about the data security in such instances, it makes catching errors and fixing them absolute hell, especially if the clients are unaware that there are the occasional exceptions appearing into the logs, or they send the wrong logs (in the case of old fashioned file based logging with unclear logging strategies).

      Daily ETL with data anonymization/pseudonymization from the prod and into the test environments would be really good to have, yet i haven't really seen any companies adopt that. The closest i've seen were situations where, the production data would be manually exported, scripts run against it and then given to the developers quarterly at best.

      That concludes my tiny rant that's vaguely related to the topic (DB data vs log data), though that could also encourage discussion about which data is available to other developers and how they approach it (e.g. trying to never log things like monetary amounts or even person data in logs to make them harmless and the tradeoffs of that, like them becoming more useless). Heck, maybe someone out there has automated the things i mentioned above.

Honestly I'd rather them not send an apology email... because then I'd have two useless emails. And in the grand scheme of things, it's literally just one more email I receive during the day.

  • I for one am happy to know that they're running integration tests at all, so I don't think they have anything to apologize for. It was just an email -- I get many uninvited emails per day. This uninvited email happened to be from an engineer rather than a marketer.

    • The first thought I had, though, was if some team decided to pull down their prod DB for testing so they had 'realistic data'.

      Because that would be a much bigger problem than sending an email by accident.

  • I think in the case the first email is too cryptic not to follow up without being unsettling to people.

    • I got it. It one line and includes the word test so I just ignored it. It wasn’t scary but it wasn’t clear what it was.

  • However - most of HBO's customers are non-technical and will be wondering what kind of test this is and if they should be worried.

  • Well, I got my integration test e-mail yesterday but still no apology e-mail, so I think they listened to you.

  • It depends what the email was about. If its content reads like "test email" or some nonsense like that, nevermind. But if it looks like a legit email that would have significant consequences for the recipient, were it legit, it should definitely be clarified what's up. Also, a well-crafted email will hopefully prevent people invoking the GDPR on the sender.

  • The first email was useful to someone! An apology email is useless to everyone.

    • I disagree, it's definitely useful. It's useless only if it feels/is dishonest. One way to avoid sounding dishonest is letting the dev team write the small apology piece.

Once during a system upgrade we ran some scripts and they triggered emails to about 8,000 people before we realized it (would’ve been 150k people otherwise). The next day was all about clean up, sending an apology email etc.

On the bright side, if you can accidentally send an automated email to that many people, then sending another email to them to apologise is unlikely to be a manual effort either.

  • When you've interrupted the process partway through as they did, figuring out which part of the list you need to apologize to may well be significantly more effort.

The bigger and more concerning issue would be recipients complaining about the incoming email and marking it as spam, which could damage the reputation of HBO Max's domain and potentially send future legitimate messages to spam. (provided the integration tests don't use a non-primary domain for sending)

I learned this the hard way in my last role, I worked for a small company that wanted to do custom email marketing. I was pretty gung-ho about it, I thought I'd just set up a script to loop through contacts and use mailgun to send the email from a custom domain. As we used the tool we saw a steady drop-off in click-throughs to the site, the majority of our messages were getting caught up in spam filters.

Turns out there's a whole science to email marketing, how emails should be structured and formatted etc. A lot of times the criteria for spam is how often the domain was flagged for spam in the past, the length of the email, the contents etc.

Ended up relying on this tool quite a bit: https://www.mailgun.com/deliverability/email-spam-checker/

I had a experience like this, once. Luckily it was less visible, but I felt like a fool all the same.

I came out as trans and changed my name last year, and with the name change I set up a new email alias for work. Then I set up automation to send out a gentle reminder email about the change for people who emailed my old alias. It worked fantastically for a few months… right up until the point that (due to a series of individually innocent events) the automation ended up running across the entirety of my 7 years worth of inbox. Everyone who had mailed me over the 7 years prior to the name change started getting the reminder email. One reminder for each email they had sent me. The worst part is that due to a bug in the email automation stuff, emails sent by the automation weren’t preserved in my sent box. So I don’t even know how many people I spammed. If I had to guess, I sent dozens of emails to the CEO and other execs, hundreds to my director, and thousands to people who worked closely with me over the years.

I learned a valuable lesson that day.

  • I've been laughing at myself for like two years now about how awkwardly I came out, and only now do i realize how fortunate i am that i didn't try to automate it. Thank you very much

  • I thought I was done with queer tragedy stories, but I could read more of this subgenre.

  • Do you really share these things in a professional context? What country was this in?

    • You're getting downvoted to hell, but I'm going to respond anyway. The parent poster here is fully transitioning their gender. They're not cross-dressing at night, they're not "closeted trans"--they're changing their identity for their whole life, in all contexts--personal, professional, etc. Furthermore, they don't want to be known by their old identity anymore--in the parlance, that's called a deadname. So they're informing people of that happening, of who they're identifying as from now on.

      1 reply →

    • It’s often considered professional courtesy to let people know when a property that matters to 99% of humanity (name and gender) changes permanently. Everyone takes a different approach. There are upsides and downsides to the “email autoresponder” method, but it’s certainly an acceptable option in local instances of context.

      3 replies →

    • Is there any professional context where you wouldn't share when your professional email address changed?

An HR person was poking around the admin panels of a payroll system and accidentally clicked a notify on SSN change box. An email was sent with the before and after social security number went out to 20,000 people.

I had a bug in our invoice sending script. Someone had been getting thousands of copies of their invoice. They called and politely asked if we could stop sending them that mail :D

Badge of honor to be honest. Even if HBO Max fired me I would forever be stoked to be "that guy/gal" who sent out an integ test message to the entire userbase.

  • If HBO fired them for this then they'd have to bring in someone without the "Don't accidentally send a million emails" training to replace them

    Intern isn't going to be sending mass email again without a double check

Can relate. Had an excitable junior engineer working on migrating between platforms, spammed the global list with '123456'

  • Without knowing any details, that you're framing it this way (their fault for being junior and "excitable") and that you don't mention any failed safeguards and process failures (implying that their mistake was an easy one to make) doesn't let you (and that org) appear in a great light. This sort of thing is a huge org smell for me; people will always make mistakes, how you deal with that fact tells a lot about the quality of leadership.

    And what exactly is the "excitable" bit supposed to tell us?

    • It's probably supposed to tell you this is a humorous anecdote about a mistake someone made and not a thinkpiece about all of the valuable lessons this company learned.

who thought sending another spam mail as an apology for sending spam mail was a good idea?