Comment by jldugger

1 month ago

>Turns out you can compile tens of thousands of patterns and still match at line rate.

Well, yea, sort of the magic of the regular expression <-> NFA equality theorem. Any regex can be converted to a state machine. And since you can combine regexes (and NFAs!) procedurally, this is not a surprising result.

> I ran it against the first service: ~40% waste. Another: ~60%. Another: ~30%. On average, ~40% waste.

I'm surprised it's only 40%. Observability seems to be treated like fire suppression systems: all important in a crisis, but looks like waste during normal operations.

> The AI can't find the signal because there's too much garbage in the way.

There's surprisingly simple techniques to filter out much of the garbage: compare logs from known good to known bad, and look for the stuff thats' strongly associated with bad. The precise techniques seem bayesian in nature, as the more evidence (logs) you get the more strongly associated it will appear.

More sophisticated techniques will do dimensional analysis -- are these failed requests associated with a specific pod, availability zone, locale, software version, query string, or customer? etc. But you'd have to do so much pre-analysis, prompting and tool calls that the LLM that comprise today's AI won't provide any actual value.

5 comments

jldugger

binarylogic 1 month ago

Yeah, it's funny, I never went down the regex rabbit hole until this, but I was blown away by Hyperscan/Vectorscan. It truly changes the game. Traditional wisdom tells you regex is slow.

> I'm surprised it's only 40%.

Oh, it's worse. I'm being conservative in the post. That number represents "pure" waste without sampling. You can see how we classify it: https://docs.usetero.com/data-quality/logs/malformed-data. If you get comfortable with sampling the right way (entire transactions, not individual logs), that number gets a lot bigger. The beauty of categories is you can incrementally root out waste in a way you're comfortable with.

> compare logs from known good to known bad

I think you're describing anomaly detection. Diffing normal vs abnormal states to surface what's different. That's useful for incident investigation, but it's a different problem than waste identification. Waste isn't about good vs bad, it's about value: does this data help anyone debug anything, ever? A health check log isn't anomalous, it's just not worth keeping.

You're right that the dimensional analysis and pre-processing is where the real work is. That's exactly what Tero does. It compresses logs into semantic events, understands patterns, and maps meaning before any evaluation happens.

zahlman 1 month ago

> Traditional wisdom tells you regex is slow.
Because it's uncomfortably easy to create catastrophic backtracking.
But just logical-ORing many patterns together isn't one of the ways to do that, at least as far as I'm aware.
jldugger 1 month ago

> I think you're describing anomaly detection.
Well it's in the same neighborhood. Anomaly detection tends to favor finding unique things that only happened once. I'm interested in the highest volume stuff that only happens on the abnormal state side. But I'm not sure this has a good name.
> Waste isn't about good vs bad, it's about value: does this data help anyone debug anything, ever?
I get your point but: if sorting by the most strongly associated yields root causes (or at least, maximally interesting logs), then sorting in the opposite direction should yield the toxic waste we want to eliminate?
pstuart 1 month ago

Vectorscan is impressive. It makes a huge difference if you're looping through an eval of dozens (or more) regexps. I have a pending PR to fix it so it'll run as a wasm engine -- this is a good reminder to take that to completion.
nextaccountic 1 month ago

But if you don't do anomaly detection, how can you possibly know which data is useful for anomaly detection? And thus, which data is valuable to keep