Comment by cjonas

21 hours ago

It would be wild if they dropped below the "two 9's" metric. I think they would need an additional ~16hr of outage in the 90 day rolling period.

https://mrshu.github.io/github-statuses/ suggests that their combined uptime doesn't even meet 1 nine, let alone 2.

  • The intersection of uptime across every possible service they offer isn't a particularly great metric. I get the point that they are doing badly, but it makes it look worse than I think it really is.

    What I would like to see is a combined uptime for "code services", basically Git+Webhooks+API+Issues+PRs, which corresponds to a set of user workflows that really should be their bread & butter, without highlighting things you might not care about (Codespaces, Copilot).

    • Depends how integrated those features are.

      A service's availability is capped by its critical dependencies; this is textbook SRE stuff (see Treynor et al., The Calculus of Service Availability). Copilot may well be on the side of it (and has the worst uptime, dragging everything down), but if Actions depends on Packages then Actions can be "up" while in reality the service is not functional. If your release pipeline depends on Webhooks, then you're unable to release.

      The obvious one is git operations: if you don't have git ops then basically everything is down.

      So; you're right about Copilot, but the subset you proposed (Git+Webhooks+API+Issues+PRs) has the exact same intersection problem. If git is at one nine, that entire subset is capped at one nine too, no matter how green the rest of it looks.

      And to be clear: git operations is sitting at 98.98% on the reconstructed dashboard linked above[1]. That is one nine. Github stopped publishing aggregate numbers on their own status page, which.. tells you something.

      [1]: https://mrshu.github.io/github-statuses/

      1 reply →

  • also I never had considered that breaking your up-time into a bunch of different components is just a strategy to make your SRE look better than it actually is. The combined up-time tells the real story (88%!). Thanks for the link

    • The number of nines assigned to a suite of services is not indicative of the quality of SRE at any given company, but rather a reflection of the tradeoffs a business has decided to make. Guaranteed there's a dashboard somewhere at Github looking at platform stickiness vs. reliability and deciding how hard to let teams push on various initiatives.

      1 reply →

  • ya i was just doing the math on their chart for the git operations. I added up 14.93 hours combined hours, which puts them WAY lower than the reported 99.7 metric they show right next to it.

    So based on their own reporting, the uptime number should be 99.31. Which means only like 6 additional hours and they'd fall below 99.0%