← Back to context

Comment by atombender

2 days ago

It's a lot more, we basically see random fragments of other parallel TCP streams.

Sounds like a bug on their custom CNI...or Kernel. I would implement checksum validation in the application layer as a troubleshooting measure and try to correlate the errors to other events. Is that during their maintenance windows?

  • I've checked against VM migration events and they don't match. I've not actually investigated whether there is any other type of maintenance that could affect nodes. Cluster upgrades would be one, I suppose.

    Hm, checksum validation isn't really possible since these are protocols out of our control, like Postgres, NATS, and various other HTTP-based services.

    But we could write some kind of simple canary process constantly doing dummy network access and detecting corruption.