Comment by spwa4

3 days ago

Surely if you build a status detector you realize that colo or dedicated are your only options, no? Obviously you cannot host such a service in the cloud.

1 comment

spwa4

mandevil 3 days ago

I'm not even talking about Down Detector's own infra being down, I'm talking about actual legitimate complaints from real users (which is the data that Down Detector collates and displays) because the app they are trying to use on an unaffected cloud is legitimately sending them an error- it's just because of SaaS dependencies and the nature of distributed systems one cloud going down can have a blast radius such that even apps on unaffected clouds will have elevated error rates, and that can end up confusing displays on Down Detector when large enough things go down.

My apps run on AWS, but we use third parties for logging, for auth support, billing, things like that. Some of those could well be on GCP though we didn't see any elevated error rates. Our system is resilient against those being down- after a couple of failed tries to connect it will dump what it was trying to send into a dump file for later re-sending. Most engineers will do that. But I've learned after many bad experiences that after a certain threshold of failures to connect to one of these outside system, my system should just skip calling out except for once every retryCycleTime, because all it will do is add two connectionTimeout's to every processing loop, building up messages in the processing queue, which eventually create backpressure up to the user. If you don't have that level of circuit breaker built, you can cause your own systems to give out higher error rates even if you are on an unaffected cloud.

So today a whole lot of systems that are not on GCP discovered the importance of the circuit breaker design pattern.