Comment by insanitybit

2 years ago

I work in the SIEM space, which basically involves ingesting massive amounts of data (relatively speaking). A single customer can ingest terabytes a day, or even 10s to 100s of terabytes of data a day. And you want to run near-arbitrary realtime analytics on it + batch analytics on it. It's a fun, difficult problem.

My product's big thing was to extract the data from logs and into a graph data structure. The thing is that I've just taken "huge amount of scale + nice, immutable log" and turned it into "huge amount of scale + evil, mutable graph". Building a massive-scale graph datastructure that can be mutated over time is... hard. Like, "hope you've been keeping up on your academic papers" hard.

One of the key optimizations I leveraged was to represent the graph as a CRDT. Every Node has a `merge` function that follows CRDT semantics.

This allows me to collapse states together in a way that converges.

Security queries have some interesting properties:

1. They often care about thresholds, meaning that they inherently work well with a lattice (once a you've hit a "bad" state you will always want to investigate that state - this is unlike, say, operations where if it "recovers" you can ignore it)

2. They almost always filter out data

These two properties combine nicely. It means that if our alert is a threshold, and our data only 'grows' in one direction (thanks CRDTs), we can reject queries using stale data and not worry about invalidating any caches.

0 comments

insanitybit

No comments yet

Contribute on Hacker News ↗