Comment by belter

1 day ago

Sounds like a bug on their custom CNI...or Kernel. I would implement checksum validation in the application layer as a troubleshooting measure and try to correlate the errors to other events. Is that during their maintenance windows?

I've checked against VM migration events and they don't match. I've not actually investigated whether there is any other type of maintenance that could affect nodes. Cluster upgrades would be one, I suppose.

Hm, checksum validation isn't really possible since these are protocols out of our control, like Postgres, NATS, and various other HTTP-based services.

But we could write some kind of simple canary process constantly doing dummy network access and detecting corruption.