Comment by xmprt

3 days ago

On the other hand, investing in better tracing tools unlocks a whole nother level of logging and debugging capabilities that aren't feasible with just request logs. It's kind of like you mentioned with using the user id as a "trace" in your first message but on steroids.

These tools tend to be very expensive in my experience unless you are running your own monitoring cloud. Either you end up sampling traces at low rates to save on costs, or your observability bill is more than your infrastructure bill.

  • We self host Grafana Tempo and whilst the cost isn’t negligible (at 50k spans per second), the money saved in developer time when debugging an error, compared to having to sift through and connect logs, is easily an order of magnitude higher.

  • Doing stuff like turning on tracing for clients that saw errors in the last 2 minutes, or for requests that were retried should only gather a small portion of your data. Maybe you can include other sessions/requests at random if you want to have a baseline to compare against.

  • I like to write them on my own in every company Im in using bash. So I have a local set of bash commands to help me figure out logs and colorize the items I want to.

    Takes some time and its a pain in the ass initially, but once I've matured them - work becomes so much more easy. Reduces dependability on other people / teams / access as well.

    Edit: Thinking about this, they wont work in other use cases. Im a data engineer so my jobs are mostly sequential.

  • Try open-source databases specially designed for traces, such as Grafana Tempo or VictoriaTraces. They can handle the data ingestion rate of hundreds of thousands trace spans per second on a regular laptop.