Comment by aluzzardi

2 days ago

Mendral co-founder and post author here.

I agree with your statement and explained in a few other comments how we're doing this.

tldr:

- Something happens that needs investigating

- Main (Opus) agent makes focused plan and spawns sub agents (Haiku)

- They use ClickHouse queries to grab only relevant pieces of logs and return summaries/patterns

This is what you would do manually: you're not going to read through 10 TB of logs when something happens; you make a plan, open a few tabs and start doing narrow, focused searches.

In my systems, I just go to an error log that gets posted to a Slack channel then go to the the log file and grep for full message that got dumped to Slack. That then gives me everything that happened before and a state dump after. That state dump can be given to a program to tell us if any state errored and what happened before tells us what the expectation was and what the precise error was. Using a LLM would just be slower and more expensive for this.