Comment by bell-cot

5 hours ago

Re-sort the takeaway points, to put this one first:

> Prioritize human factors. Outage recovery depends on what operators can see and do under stress. When dashboards fail, clear logs, simple commands, and predictable behavior matter more than complex mechanisms.

Why - to make it really, really clear to bullet-skimming managers and complexity-loving engineers that too-clever "solutions", and just-an-afterthought "testing & training", and poorly documented configurations will turn into worlds of pain when things really go wrong. The "smart people" won't be in the Operations Center then. Let alone with all the details fresh in their minds. And several of them may have taken jobs elsewhere, to not much care if the org is desperate for their help right now.