Comment by belter

6 months ago

You could start by running similar parallel infra on AWS that is ECC everywhere...And also check the corrupted TCP streams for single-bit flip patterns, also maybe correlate timing with memory pressure as that would be what and when the RAM errors would typically show. If its more than just bit flips, could be something else.

3 comments

belter

atombender 6 months ago

It's a lot more, we basically see random fragments of other parallel TCP streams.

belter 6 months ago
Sounds like a bug on their custom CNI...or Kernel. I would implement checksum validation in the application layer as a troubleshooting measure and try to correlate the errors to other events. Is that during their maintenance windows?
- atombender 6 months ago
  
  I've checked against VM migration events and they don't match. I've not actually investigated whether there is any other type of maintenance that could affect nodes. Cluster upgrades would be one, I suppose.
  Hm, checksum validation isn't really possible since these are protocols out of our control, like Postgres, NATS, and various other HTTP-based services.
  But we could write some kind of simple canary process constantly doing dummy network access and detecting corruption.