Comment by nirb89
15 hours ago
Hey all,
I'm the author of the blog post. I'm honestly loving the discussion this is generating (including the less flattering comments here). I'll try to answer some of the assumptions I've seen, hopefully it clears a few things.
First off - some numbers. We're a near real-time cybersecurity platform, and we ingest tens of billions of raw events daily from thousands of different endpoints across SaaS. Additionally, a significant subset of our customers are quite large (think Fortune 500 and up). For the engine, that means a few things:
- It was designed to be dynamic by nature, so that both out-of-the-box and user-defined expressions evaluate seamlessly.
- Schemas vary wildly, of which there are thousands, since they are received from external sources. Often with little documentation.
- A matching expression needs to be alerted on immediately, as these are critical to business safety (no use triggering an alert on a breached account a day later).
- Endpoints change and break on a near-weekly basis, so being able to update expressions on the fly is integral to the process, and should not require changes by the dev team.
Now to answer some questions:
- Why JSONata: others have mentioned it here, but it is a fantastic and expressive framework with a very detailed spec. It fits naturally into a system that is primarily NOT maintained by engineers, but instead by analysts and end-users that often have little coding expertise.
- Why not a pre-existing library: believe me, we tried that first. None actually match the reference spec reliably. We tried multiple Go, Rust and even Java implementations. They all broke on multiple existing expressions, and were not reliably maintained.
- Why JSON at all (and not a normalized pipeline): we have one! Our main flow is much more of a classic ELT, with strongly-defined schemas and distributed processing engines (i.e. Spark). It ingests quite a lot more traffic than gnata does, and is obviously more efficient at scale. However, we have different processes for separate use-cases, as I suspect most of the organizations you work at do as well.
- Why Go and not Java/JS/Rust: well, because that's our backend. The rule engine is not JUST for evaluating JSONata expressions. There are a lot of layers involving many aspects of the system, one of which is gnata. A matching event must pass all these layers before it even gets to the evaluation part. Unless we rewrote our backend out in JS, no other language would have really mitigated the problem.
Finally, regarding the $300k/year cost (which many here seem to be horrified by) - it seems I wasn't clear enough in the blog. 200 pods was not the entire fleet, and it was not statically set. It was a single cluster at peak time. We have multiple clusters, each with their own traffic patterns and auto-scaling configurations. The total cost was $25k/month when summed as a whole.
Being slightly defensive here, but that really is not that dramatic a number when you take into account the business requirements to get such a flexible system up and running (with low latency). And yes, it was a cost sink we were aware of, but as others have mentioned - business ROI is just as important as pure dollar cost. It is a core feature that our customers rely on heavily, and changing its base infrastructure was neither trivial nor cost-effective in human-hours. AI completely changed that, and so I took it as a challenge to see how far it could go. gnata was the result.
To me, the odd part is when you compare the performance of RPC vs inline code. You present it as if you found something new and foundational, only possible thanks to AI, when in fact, it has nothing to do with AI, and the results should be no surprise to anyone.
Your original architecture was a kludge to start with, it was a self-inflicted wound. This is probably the craziest part:
> We’d tried a few things over the years - optimizing expressions, output caching, and even embedding V8 directly into Go (to avoid the network hop).
I know hindsight is 20/20 - but still, you made the wrong decision at the start, and then you kept digging the hole deeper and deeper. Hopefully a good lesson for everyone working with microservices.
To end on a more positive note, I think this (porting code to other languages/platforms) is one use-case where AI code generation really shines, and will be of immense value in the future. Great reporting, just let's not confuse code generation with architectural decisions.
Oh, I don't disagree. The original vision and what the product ended up doing are light years apart. Likely, had we known what it would evolve into, we would have decided on a different solution (perhaps not JSONata at all, for example).
Having said that, My opinion still is that the previous solution had valid business merit. Though inefficient, the fact that it was infinitely scalable and the only limit was pure dollar cost is pretty valuable. It enables business stakeholders / managers to objectively quantify the value of the feature (for X dollars we get Y business, scaling linearly). I've worked in many systems where this was not at all the case, and there was a hard-limit at some point where the feature simply shut down.
> Finally, regarding the $300k/year cost (which many here seem to be horrified by) - it seems I wasn't clear enough in the blog. 200 pods was not the entire fleet, and it was not statically set. It was a single cluster at peak time. We have multiple clusters, each with their own traffic patterns and auto-scaling configurations. The total cost was $25k/month when summed as a whole.
So, then, what do you estimate the actual savings of the transition to be, taking into account only the component in question and its actual resource needs? (i.e. not simply projecting based on a linear multiple of peak utilization).
I'm going to be a little harsh here, and please forgive me: intellectual dishonesty, especially when the hard numbers are easily determinable, is something I've denied engineers' promotions for. It's genuinely impressive that you've saved the company money, but $500k/year based on peak projections is a very different number than, say, $100k/year in actual resources saved over the full course of it.
200 pods was peak allocation on a specific cluster, not total sustained cost for all of prod. The savings are taken by quite literally looking at the last month's bill on the cloud, compared to the new one after all optimizations applied and resources were aligned.
I appreciated the writeup and your clarification.
I wonder whether this was your first attempt to solve this issue with LLMs, and this was the time you finally felt they were good enough for the job. Did you try doing this switch earlier on, for example last year when Claude Code was released?
Honestly, I was very adverse to agentic code up until Opus came out. The hallucinations and false confidence it had in objectively wrong answers just broke more things than it fixed.
However after it came out it suddenly behaved closely to what they marketed it as being. So it was my first real end-to-end project relying on AI at the front seat. Though design wise it is nowhere near perfect, I was holding it's hand the entire way throughout.