Comment by misir
17 hours ago
From what I've seen, if you're depending on AWS, if something fails you too need someone 24x7 so that you can take action as well. Sometimes magic happens and systems recover after aws restarts their DNS, but usually the combination of event causes the application to get into an unrecoverable state that you need manual action. It doesn't always happen but you need someone to be there if it ever happens. Or bare minimum you need to evaluate if the underlying issue is really caused by AWS or something else has to be done on top of waiting for them to fix.
How many problems is AWS able to handle for you that you are never aware of though?
Distributed systems can partly fail in many subtly different ways, and you almost never notice it because there are people on-call taking care of them.
How many problems do you think there are?
I've only had one outage I could attribute to running on-prem, meanwhile it's a bit of a joke with the non-IT staff in the office that when "The Internet" (i.e. Cloudflare, Amazon) goes down with news reports etc our own services are all running fine.