Comment by lep_qq

1 month ago

This resonates. We run a similar setup (managed K8s + managed DBs) and hit a comparable issue last year with a cloud provider's CNI update that broke pod-to-service networking for 6 hours. The irony is that "managed" services often abstract away the problems you can fix (config, scaling, backups) while exposing you to problems you can't fix (vendor infrastructure bugs, dependency conflicts between their managed components). What helped us:

Redundancy across failure domains: We now run critical stateful workloads with connection pooling that can failover between private and public endpoints. Yes, it's more complexity, but it's complexity we control. Synthetic monitoring for managed services: We probe not just our app, but also the managed service endpoints from multiple network paths. Catches these "infrastructure layer" failures faster. Backup connectivity paths: For managed DBs, we keep both private VPC and public (firewalled) endpoints configured. If one breaks, we can switch in minutes via config.

The DaemonSet workaround is... alarming. It's essentially asking you to run production-critical infrastructure code from an untrusted source because their managed platform has a known bug with no ETA. Your point about trading failure modes is spot on. Managed services are still worth it for small teams, but the value prop is "fewer incidents" not "no incidents," and when they do happen, your MTTR is now bounded by vendor response time instead of your team's skills. Did DO at least provide the DaemonSet from an official source, or was it literally "here's a random GitHub link"?

1 comment

lep_qq

neilfrndes 18 days ago

> Did DO at least provide the DaemonSet from an official source, or was it literally "here's a random GitHub link"?

quoting verbatim from their email:

> For long-term remediation, our team has also created a DaemonSet that runs this flush command on all nodes automatically. You can find it at the link: https://github.com/okamidash/ARP-DOKS-FIX