Comment by m3047

3 days ago

I agree with this statement: "Instead of logging what your code is doing, log what happened to this request." but the impression I can't shake is that this person lacks experience, or more likely has a lot of experience doing the same thing over and over.

"Bug parts" (as in "acceptable number of bug parts per candy bar") logging should include the precursors of processing metrics. I think what he calls "wide events" I call bug parts logging in order to emphasize that it also may include signals pertaining to which code paths were taken, how many times, and how long it took.

Logging is not metrics is not auditing. In particular processing can continue if logging (temporarily) fails but not if auditing has failed. I prefer the terminology "observables" to "logging" and "evaluatives" to "metrics".

In mature SCADA systems there is the well-worn notion of a "historian". Read up on it.

A fluid level sensor on CANbus sending events 10x a second isn't telling me whether or not I have enough fuel to get to my destination (a significant question); however, that granularity might be helpful for diagnosing a stuck sensor (or bad connection). It would be impossibly fatiguing and hopelessly distracting to try to answer the significan question from this firehose of low-information events. Even a de-noised fuel gauge doesn't directly diagnose my desired evaluative (will I get there or not?).

Does my fuel gauge need to also serve as the debugging interface for the sensor? No, it does not. Likewise, send metrics / evaluatives to the cloud not logging / observables; when something goes sideways the real work is getting off your ass and taking a look. Take the time to think about what that looks like: maybe that's the best takeaway.

> Logging is not metrics is not auditing.

I espouse a "grand theory of observability" that, like matter and energy, treats logs, metrics, and audits alike. At the end of the day, they're streams of bits, and so long as no fidelity is lost, they can be converted between each other. Audit trails are certainly carried over logs. Metrics are streams of time-series numeric data; they can be carried over log channels or embedded inside logs (as they often are).

How these signals are stored, transformed, queried, and presented may differ, but at the end of the day, the consumption endpoint and mechanism can be the same regardless of origin. Doing so simplifies both the conceptual framework and design of the processing system, and makes it flexible enough to suit any conceivable set of use cases. Plus, storing the ingested logs as-is in inexpensive long-term archival storage allows you to reprocess them later however you like.

  • Auditing is fundamentally different because it has different durability and consistency requirements. I can buffer my logs, but I might need to transact my audit.

    • For most cases, buffering audit logs on local storage is fine. What matters is that the data is available and durable somewhere in the path, not that it be transactionally durable at the final endpoint.

      12 replies →

  • Saying they are all the same when no fidelity is lost is missing the point. The only distinction between logs, traces, and metrics is literally what to do when fidelity is lost.

    If you have insufficient ingestion rate:

    Logs are for events that can be independently sampled and be coherent. You can drop arbitrary logs to stay within ingestion rate.

    Traces are for correlated sequences of events where the entire sequence needs to be retained to be useful/coherent. You can drop arbitrary whole sequences to stay within ingestion rate.

    Metrics are pre-aggregated collections of events. You pre-limited your emission rate to fit your ingestion rate at the cost of upfront loss of fidelity.

    If you have adequate ingestion rate, then you just emit your events bare and post-process/visualize your events however you want.

    • > If you have insufficient ingestion rate

      I would rather fix this problem than every other problem. If I'm seeing backpressure, I'd prefer to buffer locally on disk until the ingestion system can get caught up. If I need to prioritize signal delivery once the backpressure has resolved itself, I can do that locally as well by separating streams (i.e. priority queueing). It doesn't change the fundamental nature of the system, though.

    • Good summary IMO.

      > You can drop arbitrary logs to stay within ingestion rate.

      Another way I've heard this framed in a production environments ingesting a firehose is: you can drop individual logging events because there will always be more.

      1 reply →