Comment by alexfoo

29 minutes ago

One team can't troubleshoot AND FIX every possible subsystem, so you just end up with lots (growing to hundreds) of people "on-call" anyway.

As others have said, follow-the-sun type models do exist, usually staffed by people in their normal working hours (EMEA, Americas, APAC) but this means you've still got to cover the weekend and public holidays (which there are a lot of when you factor in plenty of different countries).

Where you need a quick response you can have a core ops/noc team that looks at things with lower thresholds and shorter windows, and their job is to do the initial triage and then page the appropriate team earlier than they would have been alerted by their own alert thresholds/monitoring.

Actually clicking the button to change the status on a public status page is a whole different topic that becomes very political in certain companies.