Comment by drcongo

4 hours ago

Same. It's weird how I always find out that GitHub is down before GitHub does. Took 15 minutes before it appeared on githubstatus.com

48 comments

drcongo

jaapz 4 hours ago

All these monitoring rules are of the format "when 500 errors > baseline for x minutes". Otherwise you'd have monitoring alerts every second. So it is normal for users to already see errors before github officially counts it as an outage.

logifail 2 hours ago
> All these monitoring rules are of the format "when 500 errors > baseline for x minutes". Otherwise you'd have monitoring alerts every second. So it is normal for users to already see errors before github officially counts it as an outage.
Is it true that official service status pages are updated automatically?
- baby_souffle 2 hours ago
  
  > it true that official service status pages are updated automatically?
  Depends. Typically no because there’s an art to crafting the actual message around impact… but sometimes yes it is automated
hnlmorg 4 hours ago
You'd expect them to be monitoring more than just the HTTP response codes from user requests for precisely this reason.
If the first they hear of an outage is when user requests start to fail, then that's a failure in their monitoring as well.
But effective monitoring is harder than people assume.
- dncornholio 3 hours ago
  
  > If the first they hear of an outage is when user requests start to fail, then that's a failure in their monitoring as well.
  Isn't that what monitoring actually is? The issue seems to be in their testing, not monitoring.
  
  6 replies →
echelon 4 hours ago
In a high performance service with good maintenance and upkeep, you page for all 500s. A noisy pager forces the team to fix the 500s.
Maybe the Github Actions infrastructure isn't run like that.
edit: my oncall rotation notified on all 500s, 24/7, not just rates - https://news.ycombinator.com/item?id=48279262
- Doohickey-d 4 hours ago
  
  Im curious about this: because in my experience (working on smaller services though), a small number of errors is always there, as a "baseline".
  Recently there was this: https://news.ycombinator.com/item?id=47252971 "10% of Firefox crashes are caused by bitflips"
  Which makes me think a small amount of random issues which happen even though nothing is broken, is normal everywhere. Especially once move things around on a network, there's potential for a lot more random errors.
  
  4 replies →
- compumike 3 hours ago
  
  Re: "page for all 500s": there's a world of difference between "page me with a critical alert at 3am" and "notify me on Monday morning when my normal workday starts". At the extremes:
  If my DB health check endpoint is returning 500s for N consecutive checks over M minutes, yeah, please wake me up at 3am!
  If one user hit a weird edge case in form validation and got a one-off 500, please don't! We can fix that on Monday.
  Not always easy to distinguish those clearly or configure those business hours rules, but for my team at https://heyoncall.com/ that is the goal -- otherwise your team burns out fast. Waking up someone at 3am has a real cost, so you better be sure it's worth it.
  
  4 replies →
- TheDong 4 hours ago
  
  Do you know of a single service at a single company that actually does that?
  I know all of Gmail, every GCE service I can think of, every AWS service I can think of, Amazon.com, Netflix, and Github all do not page on just a single 500.
  I know none of those are particularly "high performance" though. Curious where your experience is coming from.
  
  7 replies →
- hvb2 3 hours ago
  
  > A noisy pager forces the team to fix the 500s.
  I'm sure you're not in ops. Or in a dev org of a service with decent request rates.
  What you're asking for is a service to fail silently. There's no way a service with a decent request rate to have 0 500s. Not when it still sees development.
  A 50 year old bank API? Maybe...
- awithrow 4 hours ago
  
  that is absolutely not the case for any system of size and scale. that would just burn out the on-call team and not result in improvements. Error rates/budgets are used instead.
  
  1 reply →
- rhyperior 3 hours ago
  
  You only do this when you’re trying to use incident management as a hammer to make a point to somebody whom you have otherwise failed to convince to fix something through persuasive argument. Ie, it’s punitive.
- swiftcoder 3 hours ago
  
  Yeah, no, nobody runs cloud services like that. At AWS most alarms required failures in 3 consecutive 5 minute periods. Critical things could be on 3 consecutive 1 minute windows - but that alarm starts a 15 minute escalation for the oncall engineer to check in, and they have to validate the issue isn't a false alarm before updating the status page would even be considered
- jordemort 4 hours ago
  
  forget it, Jake; it’s Azure
registeredcorn 1 hour ago
I'm not arguing with what you're saying, but it does make me wonder: What exactly is the point of the status page, if "it is normal for users to already see errors before GitHub officially counts it as an outage"?
Is it more so to have something to link to for managers who aren't using the service have a pretty bar to look at and feel like they are "doing something"? Or is it more of a kind of a way to prevent confirming what you already suspect to be true. E.g. "Huh. Me and Jim are seeing problems. How about you Tom? Oh wait, crud. The service page is confirming it's down now. Never mind! Who wants coffee?!"
- filleduchaos 1 hour ago
  
  There is oddly enough a middle ground between "zero errors whatsoever" and "outage".

simonjgreen 4 hours ago

More likely that 'update the Status site' lives a long way down their incident response plan, and they have alarms going off well before that

jordemort 4 hours ago
yeah I mean a company the size of GitHub certainly can’t be expected to have enough staff to walk and chew gum at the same time
- swiftcoder 3 hours ago
  
  If it's like other BigTechs I have worked at, you need director-level signoff and comms team approval to post an outage notice
PunchyHamster 3 hours ago
it should be automatic tho. Probably isn't so they can at least get the one nine on availability
- simonjgreen 2 hours ago
  
  Marketing definitely takes interest in status sites

chrisjj 12 minutes ago

That's the time taken by Head of PR approval.

chrisjj 24 minutes ago

> githubstatus.com

There's a threshold. It shows only once 1000 users complain.

re-thc 4 hours ago

> It's weird how I always find out that GitHub is down before GitHub does

No, it's not. Official updates = potential SLA penalties. Always requires approval.

drcongo 1 hour ago

This is the most plausible reply.