Comment by eddythompson80
18 hours ago
SRE agents are the worst agents. I totally get why business and management will demand them and love them. After all, they are the n+1 of customer support chat bot that you get frustrated talking to before you find the magic way to get to a person.
We have been using few different SRE agents and they all fucking suck. The way they are promoted and run always makes them eager to “please” by inventing processes, services, and work-arounds that don’t exist or make no sense. Giving examples will always sound pity or “dumb”. Every time I have to explain to management where SRE agent failed they just hand wave it and assume it’s a small problem. And the problem is, I totally get it. When the SRE agent says “DNS propagation issues are common. I recommend flushing dns cache or trying again later” or “The edge proxy held a bad cache entry. Cache will eventually get purged and the issue should be solved eventually” sounds so reasonable and “smart”. The issue was in DNS or in the proxy configuration. How smart was the SRE agent to get there? They think it’s phenomenal and it may be. But I know that the “DNS issue” isn’t gonna resolve itself because we have a bug in how we update DNS. I know the edge proxy cache issue is always gonna cause a particular use case to fail because the way cache invalidation is implemented has a bug. Everyone loves deflection (including me) and “self correcting” systems. But it just means that a certain class of bugs will forever be “fine” and maybe that’s fine. I don’t know anymore.
That’s my experience working with most SRE humans too. They’re more than happy to ignore the bug in DNS and build a cron job to flush the cache every day instead.
So in some sense the agent is doing a pretty good job…
I have no personal experience with the SRE agents, but I used Codex recently when trying to root cause an incident after we're put in a stop gap, and it did the last mile debugging of looking through the code for me once I had assembled a set of facts & log lines and accurately pointed me to some code I had ignored in my mental model because it was so trivial I didn't think it could be an issue.
That experience made me think we're getting close to SRE agents being a thing.
And as the LLM makers like to reiterate, the underlying models will get better.
Which is to say, I think everyone should have some humility here because how useful the systems end up being is very uncertain. This of course applies just as much to execs who are ingesting the AI hype too.
I guess that depends on how you use agents (SRE or in general). If you ask it a question (even implicitly) and blindly trust the answer, I agree. But if you have it help you find the needle in the haystack, and then verify that did indeed find the needle, suddenly it’s a powerful tool.
Have you used Amazon Q? It's actually pretty handy at investigating, diagnosing, and providing solutions for AWS issues. For some reason none of our teams use it, and waste their time googling or opening tickets for me to answer. I go to Q and ask it, it provides the answer, I send it back to the user. I don't think an "SRE Agent" will be useful because it's too generic, but "Agent customized to solve problems for one specific product/service/etc" can actually be very useful.
That said, I think you're right that you can't really replace an Operations staff, as there will always need to be a human making complex, multi-dimensional decisions around constantly changing scenarios, in order to keep a business operational.