Comment by fleddr
4 years ago
Amateur mistake, compared to my professional mistakes.
I once wrote a bot that sent email within the company (100K+ employees). I kick-started the bot on a server remotely and only then discovered it's an endless loop. It required server admin rights to stop it, which I did not have.
I couldn't immediately reach server admins so had to physically drive there. An hour or so later somebody helped me kill the process.
The emails already sent could not be cleared out server-side, which meant that recipients had freezing email clients for over a week, unable to handle the volume, a typical 300K new emails per recipient. They had to Ctrl+A and Delete 100 emails, do this for the next 100, and so on, whilst not deleting real and useful emails.
I pretty much destroyed email for those people.
I don't just destroy things at scale though, also at home. Around the time we had our first home broadband internet connection, I set up a web server and just kept my PC running. Unknown to me, the web server software included an email server with open relay enabled by default.
About 3 days later, my dad complained about the internet connection not working. The IPS detected the issue (millions of emails sent out via my home server) and gave us a red card, fully shutting us down, permanently.
> I kick-started the bot on a server remotely and only then discovered it's an endless loop. It required server admin rights to stop it, which I did not have.
That sinking feeling when you realize you've started something bad and can't stop it always gives me a visceral feeling like the world is doing a dolly zoom* around me.
I have had two really fun bulk email screwups:
The first one started when we hit the the send button for a mass email campaign driving traffic to our newly launched website redesign. We immediately realized that the email marketing software had put a unique query parameter in every link that was resulting in all requests to our cache going to origin, which instantly smoked the little vm hosting the site, sending thousands of clicks to the now famous 503 Guru Meditation page. With the infrastructure folks offline in another timezone, it was the perfect environment to learn Varnish Configuration Language on the fly with the whole marketing team hanging over me with looks of horror on their faces!
The second one involved coming to work, sitting down with my coffee and noticing that our email sending process had crashed overnight. Given the rate they could be sent sequentially I realized we'd have a big backlog of tasks, so I wrote a quick shell script to split the tasks into separate lists and parallelize them across the number of cpus on our (big, colocated, dedicated, hosting many important apps and websites) server. As soon as I ran it I sent it to the background and opened up `top`, only to see thousands and thousands of forks of my process filling the list. By the time I realized that I'd flipped the numbers and split 4 email tasks into each of 50,000 processes instead of 50,000 tasks into 4 processes, the server locked up and my SSH session disconnected. Cue several panicked minutes of our apps being offline while I scrambled for the restart button in the remote management console. Somehow there were no lasting effects, and every service started up on its own when the box came back online, even though it hadn't been restarted in several years.
* https://filmschoolrejects.com/wp-content/uploads/2021/01/Jaw...
The proposed name for that moment of realization and horror is the onosecond:
https://youtu.be/X6NJkWbM1xk
Beautiful screw-ups, thanks for sharing.
On a serious note, the trend where development and admins now operate in a single blur, I find concerning. It may now be quite common for a front-end developer to also do all kinds of potentially disastrous admin/infra changes.
I think this is in particularly true with all the cloud stuff. 20 years ago, adding new servers to our DC would be a 6 month process. Now I can accidentally spin up 500 in 1 second.
> On a serious note, the trend where development and admins now operate in a single blur,
It is I know enough about admin stuff to know that I don't know much of the endless amount of small but "for production" very important things which you can get away without on a local dev setup and which don't tell you they are wrong. So it will just not work grate and you have no idea why and might even think it's a problem of your software instead of your setup.
Doing admin properly is as big of a job as any programming task and it's a very different field of expertise.
And docker doesn't fix it, at all. It at best improves the illusion of you doing admin stuff correctly.
>That sinking feeling when you realize you've started something bad and can't stop it always gives me a visceral feeling like the world is doing a dolly zoom* around me.
The largest unit of time known to mankind, the "Ohnosecond".
Thanks for sharing this. I hope that the person who made the mistake (or better yet, their manager) sees these types of stories and realizes how common this sort of thing is, even for good engineers.
I wouldn't go so far as to say that it's never an engineer's fault, but this sort of thing usually relates more to faulty processes than people.
I'm reminded of the engineer at AWS who was writing a bash script about 5 years ago and unwittingly took down a good chunk of AWS's main east coast region, causing outages for tens of thousands of websites. I remember admiring that their response wasn't to fire the engineer, but rather to say that any system that large that allows a single non-malicious engineer to take it down must need some beefing up to make that sort of mistake impossible.
Thank you for the appreciation, it was in part my goal to normalize mistakes and to treat them in a light manner.
I in particular object to kicking a man when already down, which is a common behavior these days.
I think spreading awareness of the error, pointing out the stupidity of it, or taking entertainment value from it is really cruel. I'm sure the person involved is aware they screwed up and is embarrassed, so all these piled up messages only hurt.
And I still think it was incredibly harmless. Any individual would get at most one useless email, which contained nothing inappropriate.
As said, this intern must professionalize in their errors, as this is not a pro level screwup.
To avoid accidentally deleting useful emails while “Ctrl-A-ing”, could they have run a search by sender or some other criteria first? Then only delete the search results.
In the email client? Probably yes, if they know how. This was a very long time ago, early 2000s, and the email client was the dreaded Lotus Notes software.
I had to use that client recently. I can't believe it doesn't even have a proper email search function. Best you can get is ctrl-f on the open page or sorting all your email alphabetically.
The company didn't have sysadmins for the email server that could have cleaned up the mess?
> I pretty much destroyed email for those people.
GUI clients might be destroyed, but wasn't writing an IMAP script to bulk delete all emails an option?
>email server with open relay enabled by default.
Every self hoster's nightmare holy cow.
The only fun part of the experience was checking the email queue, once I discovered it.
To my amazement, it was simply a dictionary-attack style of approach:
a@yahoo.com aa@yahoo.com ab@yahoo.com
And so on. This was 20 years ago so I'm sure it's more sophisticated now, sending based on web scrapers, leaked databases, etc.