Comment by waych

5 hours ago

Live patching production kernels makes sense when there is an imminent threat/timeline and rebooting is throttled due to underlying throttling mechanisms that are guarding the health of distributed systems running a-top the systems. Here's a real example I am familiar with:

Consider a hyper-converged cluster with many nodes serving distributed block storage, say at N=3 replication. This can tolerate exactly one N=1 node of outage for the reboot. It would seem preferable to drain the nodes in a way that allows for more parallelism in the per-node kernel-reboot process, but draining is expensive and its cheaper to reboot and hope the data comes back to the pool within some period of time after the reboot. This gets worse linearly as the cluster grows.

A non-trivial size cluster facing this can have a reboot rollout easily stretch from hours into days and even weeks. It is further made slower when the roll-out itself is repeatedly paused when any other production issue is detected, or some other in-cluster event is happening and distributed storage health is degraded or unavailable. If a single (additional) node goes out during the reboot roll-out, data goes unavailable and storage must wait and heal. It also simply takes time for the cluster to reconcile when the storage eventually comes back from reboot to make sure it is all still there.

If your systems are large enough, things will go so slow that things fall into the trap where the target release changes mid-deployment: to benefit from everything learned in the last many days or weeks, security, performance, crashes, whatever! There is benefit because the fixes you cared about most got onto a portion of the cluster sooner than later. There is also penalty, as this resets the time it takes to deploy, elongating the perceived end-to-end deployment time. This negatively affects OKRs and similarly displaces the release of anything that was queued for upcoming releases.

So yeah, live patching is great to get priority fixes out in a matter of minutes or hours. I also think it is the best tool to get oneself out of this rollout-reset trap and onto the next release sooner. Faster than rollback or rollover.

0 comments

waych

No comments yet

Contribute on Hacker News ↗