Comment by pulkitsh1234
2 hours ago
Genuinely curious, how to actually implement detection systems for a large scale global infra which that works with < 1 minute SLO ? Given cost is no constraint.
2 hours ago
Genuinely curious, how to actually implement detection systems for a large scale global infra which that works with < 1 minute SLO ? Given cost is no constraint.
Right now I'd say maybe don't push changes to your entire global infra all at once and certainty not without testing your change first to make sure it doesn't break anything, but it's really not about a specific failure/fix as much as it is about a single company getting too big to do the job well or just plain doing more than it should in the first place.
Honestly we shouldn't have created a system where any single company's failure is able to impact such a huge percentage of the network. The internet was designed for resilience and we abandoned that ideal to put our trust in a single company that maybe isn't up for the job. Maybe no one company ever could do it well enough, but I suspect that no single company should carry that responsibility in the first place.