Comment by aluzzardi
2 days ago
Mendral co-founder and post author here.
I agree with your statement and explained in a few other comments how we're doing this.
tldr:
- Something happens that needs investigating
- Main (Opus) agent makes focused plan and spawns sub agents (Haiku)
- They use ClickHouse queries to grab only relevant pieces of logs and return summaries/patterns
This is what you would do manually: you're not going to read through 10 TB of logs when something happens; you make a plan, open a few tabs and start doing narrow, focused searches.
In my systems, I just go to an error log that gets posted to a Slack channel then go to the the log file and grep for full message that got dumped to Slack. That then gives me everything that happened before and a state dump after. That state dump can be given to a program to tell us if any state errored and what happened before tells us what the expectation was and what the precise error was. Using a LLM would just be slower and more expensive for this.