← Back to context

Comment by atombender

2 days ago

I wonder if machines without ECC could perhaps explain why our apps periodically see TCP streams with scrambled contents.

On GKE, we see different services (like Postgres and NATS) running on the same VM in different containers receive/send stream contents (e.g. HTTP responses) where the packets of the stream have been mangled with the contents of other packets. We've been seeing it since 2024, and all the investigation we've done points to something outside our apps and deeper in the system. We've only seen it in one Kubernetes cluster, and it lasts 2-3 hours and then magically resolves itself; draining the node also fixes it.

If there are physical nodes with faulty RAM, I bet something like this could happen. Or there's a bug in their SDN or their patched version of the Linux kernel.

You could start by running similar parallel infra on AWS that is ECC everywhere...And also check the corrupted TCP streams for single-bit flip patterns, also maybe correlate timing with memory pressure as that would be what and when the RAM errors would typically show. If its more than just bit flips, could be something else.

  • It's a lot more, we basically see random fragments of other parallel TCP streams.

    • Sounds like a bug on their custom CNI...or Kernel. I would implement checksum validation in the application layer as a troubleshooting measure and try to correlate the errors to other events. Is that during their maintenance windows?

      1 reply →