Comment by silvestrov
14 hours ago
A Danish bank found out that this can bite you in the ass.
When you hotpatch the system for years then you have no idea if the system can boot up or it will fail somewhere in the booting process.
i.e. you can only trust what you regularly test.
Mainframes can LPAR dynamically. When you want to test if your production system will IPL cleanly, you clone your production environment to an isolated LPAR and IPL it. No impact to production and you get your test.
US telcos as well.
There were several switch failures in the 1980s / 1990s in which systems which had been upgraded in place without a full restart failed. (IIRC, one burnt down, literally.)
Engineers were uncertain as to whether or not a cold-boot restart was even possible.
Account concerning an AT&T system upgrade sourcing Risks Digest (Vol 9, Issue 62, February 26, 1990) by the recently deceased Peter G. Neumann: <https://telephoneworld.org/landline-telephone-history/the-cr...>.
Interesting, it there any public info on the case?
Not doubting it, only curious about some kind of postmorten.
In Danish: https://danskebank.com/da/news-og-insights/nyhedsarkiv/press...
or translated: https://danskebank-com.translate.goog/da/news-og-insights/ny...
TLDR: power supply failed completely and DB2 failed running recovery operations due to multiple old/existing software bugs.
Thanks for hunting it down.