Comment by da_chicken

12 hours ago

The problem with live patching is twofold.

First, you might not reload everything in memory, so it will be patched on disk but not in process.

Second, you have not tested that the system can boot to a functional system. Say you have done live patching for 5 years and never rebooted, and then you have a power loss or hardware failure/upgrade that takes the system down. When you try to bring it back up, it doesn't work. Which configuration change in the past 5 years caused that? Which backup do you use?

And, yeah, everything is hot swappable on VAX. Those machines also cost 6+ figures, and often require a service contract that includes a permanent on site tech.

And, yeah, everything is hot swappable on VAX.

Only the last generation or 2 of the highest end VAXen had any significant hot swap (VAX 9000/400 and later, which sold very poorly). The vast majority of VAX machines didn't. Even hot-swapping DSSI disks was at best iffy.

When someone whose been there talks about VAX 'high availability', they're usually talking about VAX/VMS clustering. Very cool and generally effective approach to the problem. That was one big issue with the end-game VAXen: clustering a couple of 6-figure mid-range machine was often considered a better solution than all-in on one 7- to 8-figure VAX 'mainframe'.

often require a service contract that includes a permanent on site tech.

I don't recall that being common with DEC service contracts. Most of the sites I know of that had dedicated DEC techs were either very large installs or had...other...drivers (e.g. tech had to have a TS clearance to work on the machines).

Which is moot, because of the system is important enough you'll have an automatic failover to another system running on standby

All this "we must reboot to test" is bullshit excuses by unqualified workers

  • Had an accidental reboot, and it could not boot. Had redundancy, but the other server had failed silently days prior. Solved it with three way redundancy and extra monitoring. Systems fail in many ways at the same time. If you do not test it, there is a chance it wont work. Controlled failure is preferred over unknowns, like rebooting once in a while just to make sure it works.

  • Not sure I'm following honestly. Your primary goes down and it fails over to the secondary (which becomes the primary), but if you can't boot how do you then get another secondary ready to fail over to again when the new primary inevitably fails?

  • Ah, spoken with the confidence of a freshly minted qualified worker :). Anything you don’t test is a wish, not a production system. You either know that your systems work end to end because you tested periodically, or you pray they will.

    How do you know the automatic failover works? How do you know the standby system works?

    I’ve seen many a “qualified workers” getting sent packing because they never fully tested the prod system because they just knew everything will work, and never tested the backup systems because qualified workers do the job right the first time, no need for backup.

>First, you might not reload everything in memory, so it will be patched on disk but not in process.

You design for this with generational tagged objects or something similar.

Yes, some things actually cost money, especially if they aren't easy to implement.