Comment by belter

2 days ago

- You do know their AZs are just firewalls across the same datacenter?

- And they used machines without ECC and their index got corrupted because of it? And instead of hiding the head in shame and getting lessons from IBM old timers they published a paper about it?

- What really accelerated the demise of Google+ was that an API issue allowed the harvesting of private profile fields for millions of users, and they hid that for months fearing the backlash...

Dont worry, you will have plenty more outages from the land of we only hire the best....

I wonder if machines without ECC could perhaps explain why our apps periodically see TCP streams with scrambled contents.

On GKE, we see different services (like Postgres and NATS) running on the same VM in different containers receive/send stream contents (e.g. HTTP responses) where the packets of the stream have been mangled with the contents of other packets. We've been seeing it since 2024, and all the investigation we've done points to something outside our apps and deeper in the system. We've only seen it in one Kubernetes cluster, and it lasts 2-3 hours and then magically resolves itself; draining the node also fixes it.

If there are physical nodes with faulty RAM, I bet something like this could happen. Or there's a bug in their SDN or their patched version of the Linux kernel.

  • You could start by running similar parallel infra on AWS that is ECC everywhere...And also check the corrupted TCP streams for single-bit flip patterns, also maybe correlate timing with memory pressure as that would be what and when the RAM errors would typically show. If its more than just bit flips, could be something else.