Comment by jeffbee

4 hours ago

Given that this happens on the main loop, one wonders how this escaped release quality checks. Is there another contributing cause that narrows the impact?

1 comment

jeffbee

compumike 3 hours ago

I don’t see any other impact-narrowing criteria.

The default kubelet `syncFrequency` is 1 minute (per Pod). (There can be additional event-driven ones, but this is a floor.) Back of the envelope: 30 Pods * 1440 * 7 = 300k calls to startPodSync in a week.

I’d guess a lot of production clusters don’t run the latest release of Kubernetes, and 1.36 was only released in late April, less than two months before I reported the issue.

Given that the issue shows up as linear memory growth on the order of maybe 1 GiB/month per node, there really is a time-based discovery process here. (Could be artificially accelerated for a CI test, I imagine!) And any restarted nodes or clusters that upgraded point releases would restart that clock.

The only unusual contributing cause that made the issue more visible to me was trying to run everything on a tiny 2 GiB RAM node!