← Back to context

Comment by hnlmorg

3 hours ago

No, monitoring for HTTP response code is a subset of observability and not one that generally gives you the best insights into which subsystems are misbehaving nor why.

There are synthetic tests, where you can generate API request calls or even simulate an entire user journey. These allow you to control the user agent, the payloads, and thus you know anything errors back are actual errors. These are triggered by the observability platform (think like running a cron-job) and thus you're not tied to user activity to see when problems arise.

There are other metrics outside of HTTP response codes too. Think like free RAM, CPU usage, disk space, etc. This is just naming some obvious ones because these types of metrics are generally bespoke to the type of application your monitoring. And with these types of monitors, you'd not just have an alert when things have failed, but ideally have alerts when an irregular trend is showing that things are likely to fail too. This latter type of monitors helps you get ahead of the problem before it become customer facing.

Then you have more traditional stuff like logs. This will also be bespoke to the application. But you'd expect errors in logs to get surfaced quickly. Assuming Github have good hygiene in what's being logged.

Tie that up with APMs, RUM, and other goodies like that and you'll have diagnostics to investigate issues when they appear.

(this is just a super high level view of observability too)

Even a synthetic probe needs a few failures to trigger an alert.

You should not alert on cpu, ram, etc

  • > Even a synthetic probe needs a few failures to trigger an alert.

    It doesn't "need" that. That just how most people set it up because it’s an easy sane default that allows for network jitter without inexperienced engineers thinking about different conditions triggering different types of responses.

    If you’re measuring internal APIs from an observablity solution that’s has nodes already inside you’re network enclave, then there is a strong argument for alerting early.

    > You should not alert on cpu, ram, etc

    That’s not true to say as an absolute statement. And a generalisation it heavily depends on the system your monitoring and how it behaves under pressure.

    But in any case, I wasn’t suggesting CPU alerts were the end goal. I said:

    > these types of metrics are generally bespoke to the type of application your monitoring.

    Ie you’ll use metrics but those metrics will be highly specific.

    The CPU examples were an illustration as to what a “metric” is (it might seem obvious but not everyone is an expert) but the point was HTTP response codes aren't the only types of metrics one should be capturing and watching.

    • Ah, yes, I misunderstood. And I have seen cases where a direct CPU alert makes sense, but 99 times out of 100 times I see it, it's nothing but trouble. Worse, I tend to see the cpu alert when there are no end to end synthetic alerts, 500 alerts, queue depth alerts, etc.

      If your requests are fast and cheap, you can probe frequently relative to your goals, but often that's not really possible (think, long SQL queries, or scheduling a container/pod). There you need several datapoints, or possible fewer augmented with other signals.

      2 replies →