I once worked at a company that had a wealth of backups. A backup generator, backup batteries as the generator takes a few seconds to start, a contract for emergency fuel deliveries, a complete failover data centre full of hot standby hardware, 24/7 ops presence, UPSes on the ops PCs just in case, weekly checks that the generators start, quarterly checks by turning off the breakers to the data centre, and so on.
It wasn't until a real incident that we learned: (a) the system wasn't resilient to the utility power going on-off-on-off-on-off as each 'off' drained the batteries while the generator started, and each 'on' made the generator shut down again; (b) the ops PCs were on UPSes but their monitors weren't (C13 vs C5 power connector) and (c) the generator couldn't be refuelled while running.
Even if you've got backup systems and you test them - you can never be 100% sure.
The point of being "cloud native" is you build redundancy at higher levels. Instead of having extra pipes and wires, you have extra software that handles physical failures.
I once worked at a company that had a wealth of backups. A backup generator, backup batteries as the generator takes a few seconds to start, a contract for emergency fuel deliveries, a complete failover data centre full of hot standby hardware, 24/7 ops presence, UPSes on the ops PCs just in case, weekly checks that the generators start, quarterly checks by turning off the breakers to the data centre, and so on.
It wasn't until a real incident that we learned: (a) the system wasn't resilient to the utility power going on-off-on-off-on-off as each 'off' drained the batteries while the generator started, and each 'on' made the generator shut down again; (b) the ops PCs were on UPSes but their monitors weren't (C13 vs C5 power connector) and (c) the generator couldn't be refuelled while running.
Even if you've got backup systems and you test them - you can never be 100% sure.
A plan that has never been executed is really just hope and wishful thinking.
What happens when the backup breaks?
At a certain point earth is a single point of failure.
You have a back up for the back up backup.
Turtles all the way down.
At AWS scale even unlikely hardware events become more common I guess.
Each turtle gives them another 9. How many 9s are they down due to incidents over the past year?
1 reply →
They absolutely have backups, I presume they were ineffective or also down for _reasons_.
The point of being "cloud native" is you build redundancy at higher levels. Instead of having extra pipes and wires, you have extra software that handles physical failures.