Comment by lforster

5 hours ago

They're using cloudfare for multicloud, but still have cloudfare as a single point of failure. Should make a cloudfare for cloudfare to solve this.

6 comments

lforster

nexttk 5 hours ago

Like the infamous "smiling through the pain" meme:

"I added a load-balancer to improve system reliability" (happy)

"Load balancer crashed" (smiling-through-the-pain)

PunchyHamster 5 hours ago

Reliability have very weird curve frankly.
Technically, multi-node cluster with failover (or full on active-active) will have far higher uptime than just a single node.
Practically, to get the multi-node cluster (for any non trivial workload) to work right, reliably, fail-over in every case etc. is far more work, far more code (that can have more bugs), and even if you do everything right and test what you can, unexpected stuff can still kill it. Like recently we had uncorrectable memory error which just happened to hit the ceph daemon just right that one of the OSDs misbehaved and bogged down entire cluster...

amalcon 3 hours ago

You jest, but this actually does exist. Multiple CDNs sell multi-CDN load balancing (divide traffic between 2+ CDNs per variously-complicated specifications, with failover) as a value add feature, and IIRC there is at least one company for which this is the marquee feature. It's also relatively doable in-house as these things go.

kevin_thibedeau 3 hours ago

Failover to Akamai.

cortesoft 3 hours ago

As someone who has worked for a CDN for over a decade, this is what most big customers do. Under normal circumstances, they send portions of traffic to different CDNs, usually based on cost (and or performance in various regions). When an issue happens, they will pull traffic from the problem CDN.
Of course, if a big incident happens for a big CDN, there might not be enough latent capacity in the other CDNs to take all the traffic. CDNs are a cutthroat business, with small margins, so there usually isn’t a TON of unused capacity laying around.

MichaelZuo 5 hours ago

If there’s clearly a single point of failure shouldn’t it be called a single cloud pretending to be “multicloud”?