← Back to context

Comment by JsonDemWitOster

10 hours ago

The Google SRE book offers the following as one of the reasons to not gun for 100% reliability (emphasis added):

> users typically don’t notice the difference between high reliability and extreme reliability in a service, because the user experience is dominated by less reliable components like the cellular network or the device they are working with. Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability!

I've been on a shaky relationship with my ISP of late. What brought me to this thread today is that I couldn't push to Github. Notably this isn't covered by their downtime report so, going by the available facts, it's _probably_ not Github's fault I couldn't push; and I've just been on my daily stand-up call and I got disconnected so frequently.

But looking beyond today's available facts, odds are there's a bigger problem GH is not mentioning in their status page. They say the current incident has to do with "unauthorized users" and I wonder if pushing a commit from my IDE client counts as an operation from an "unauthorized user" as I still have to authorize with my SSH key.

It's just insane I can't decide which between Github or German o2 should be the more reliable service!

Github isn't having a debate over how many 9s they have, they're having a zero 9s problem.

I think there's 3 big themes with this, thought not

1. LLM tools have added considerable load.

2. LLM used by developers to increase velocity seem to be leading more outages. This calls into question the increased velocity.

3. Roadmaps focused on pushing features that aren't reliability problems. i.e. github moving to azure, or adding AI features.

All these same problems happen to orgs with other fads that aren't AI. Following fads is not good engineering.

"unauthorized" is a bit different than "unauthenticated". The former suggests trying to access something you don't have permission for while the latter suggests you're just not logged in.

At a guess, I could imagine some sort of failure of cached pages, which can be cached for signed out users but probably not for signed in users (as the rendered HTML would need to have user context like their avatar etc)

Apparently Github is experiencing a huge increase in usage due to LLMs and this is the cause for a lot of their instability as of late.

> Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability!

Sure they can. If Google loads and Github doesn't, then it's clearly Github being down, not the mobile network.

Also not everyone uses a phone. My desktop & fibre internet has way better than 99% reliability.