← Back to context

Comment by PaulHoule

2 days ago

My first take is that you could have 10 TB of logs with just a few unique lines that are actually interesting. So I am not thinking "Wow, what impressive big data you have there" but rather "if you have an accuracy of 1-10^-6 you are still are overwhelmed with false positives" or "I hope your daddy is paying for your tokens"

Mendral co-founder and post author here.

I agree with your statement and explained in a few other comments how we're doing this.

tldr:

- Something happens that needs investigating

- Main (Opus) agent makes focused plan and spawns sub agents (Haiku)

- They use ClickHouse queries to grab only relevant pieces of logs and return summaries/patterns

This is what you would do manually: you're not going to read through 10 TB of logs when something happens; you make a plan, open a few tabs and start doing narrow, focused searches.

  • In my systems, I just go to an error log that gets posted to a Slack channel then go to the the log file and grep for full message that got dumped to Slack. That then gives me everything that happened before and a state dump after. That state dump can be given to a program to tell us if any state errored and what happened before tells us what the expectation was and what the precise error was. Using a LLM would just be slower and more expensive for this.

Yeah this is my experience with logs data. You only actually care about O(10) lines per query, usually related by some correlation ID. Or, instead of searching you're summarizing by counting things. In that case, actually counting is important ;).

In this piece though--and maybe I need to read it again--I was under the impression that the LLM's "interface" to the logs data is queries against clickhouse. So long as the queries return sensibly limited results, and it doesn't go wild with the queries, that could address both concerns?

  • What does O(10) mean?

    • Mathematically, it means that the number of lines read is bounded by 10*M, where M is some constant. So it's basically equivalent to saying that it's O(1).

      I'm guessing that intention was to say "around 10 lines", though it kind of stretches the definition if we're being picky.

      1 reply →

    • I normally see that from engineers using "O(x)" as "approximately x" whenever it's clear from context that you're not actually talking about asymptomatic complexity.

      3 replies →

    • I think the O means order of magnitude. It looks like Big O notation, but O(10) would collapse to O(1) and OP is not talking about efficiency anyway.