Expensively Quadratic: The LLM Agent Cost Curve

3 days ago (blog.exe.dev)

> By 50,000 tokens, your conversation’s costs are probably being dominated by cache reads.

Yeah, it's a well-known problem. Every AI company is working on ways to deal with it, one way or another, with clever data center design, and/or clever hardware and software engineering, and/or with clever algorithmic improvements, and/or with clever "agentic recursive LLM" workflows. Anything that actually works is treated like a priceless trade secret. Nothing that can put competitors at a disadvantage will get published any time soon.

There are academics who have been working on it too, most notably Tri Dao and Albert Gu, the key people behind FlashAttention and SSMs like Mamba. There are also lots of ideas out there for compressing the KV cache. No idea if any of them work. I also saw this recently on HN: https://arxiv.org/abs/2512.24601 for an intro to them.

  • What do you think about RLMs? At first blush it looks like sub agents with some sprinkles on top, but people who have become more adept with it seem to show its ability to handle sublinear context scaling behavior very effectively.

> Some coding agents (Shelley included!) refuse to return a large tool output back to the agent after some threshold. This is a mistake: it's going to read the whole file, and it may as well do it in one call rather than five.

disagree with this: IMO the primary reason that these still need to exist is for when the agent messes up (e.g reads a file that is too large like a bundle file), or when you run a grep command in a large codebase and end up hitting way too many files, overloading context.

Otherwise lots of interesting stuff in this article! Having a precise calculator was very useful for the idea of how many things we should be putting into an agent loop to get a cost optimum (and not just a performance optimum) for our tasks, which is something that's been pretty underserved.

  • I think that's reasonable, but then they should have the ability for the agent to, on the next call, override it. Even if it requires the agent to have read the file once or something.

    In the absence of that you end up with what several of the harnesses ended up doing, where an agent will use a million tool calls to very slowly read a file in like 200 line chunks. I think they _might_ have fixed it now (or agent-fixes, my agent harness might be fixing it), but Codex used to do this and it made it unbelievably slow.

    • You’re describing peek.

      An agent needs to be able to peek before determining “Can I one shot this or does it need paging?”

  • > when you run a grep command in a large codebase and end up hitting way too many files, overloading context.

    On the other hand, I despise that it automatically pipes things through output-limiting things like `grep` with a filter, `head`, `tail`, etc. I would much rather it try to read a full grep and then decide to filter-down from there if the output is too large -- that's exactly what I do when I do the same workflow I told it to do.

    Why? Because piping through output liming things can hide the scope of the "problem" I'm looking at. I'd rather see the scope of that first so I can decide if I need to change from a tactical view/approach to a strategic view/approach. It would be handy if the agents could do the same thing -- and I suppose they could if I'm a little more explicit about it in my tool/prompt.

    • In my experience this is what Claude 4.5 (and 4.6) basically does, depending on why its grepping it in the first place. It'll sample the header, do a line count, etc. This is because the agent can't backtrack mid-'try to read full file'. If you put the 50,000 lines into the context, they are now in the context.

      2 replies →

the quadratic curve makes sense but honestly what kills us more is the review cost - AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep. we burn more time auditing AI output than we save on writing it, and that compounds. the API costs are predictable at least

  • > then you're stuck reading every line because it might've missed some edge case or broken something

    This is what tests are for. Humans famously write crap code. They read it and assume they know what's going on, but actually they don't. Then they modify a line of code that looks like it should work, and it breaks 10 things. Tests are there to catch when it breaks so you can go back and fix it.

    Agents are supposed to run tests as part of their coding loops, modifying the code until the tests pass. Of course reward hacking means the AI might modify the test to 'just pass' to get around this. So the tests need to be protected from the AI (in their own repo, a commit/merge filter, or whatever you want) and curated by humans. Initial creation by the AI based on user stories, but test modifications go through a PR process and are scrutinized. You should have many kinds of tests (unit, integration, end-to-end, regression, etc), and you can have different levels of scrutiny (maybe the AI can modify unit tests on the fly, and in PRs you only look at the test modifications to ensure they're sane). You can also have a different agent with a different prompt do a pre-review to focus only on looking for reward hacks.

    • Tests are not free, over proliferation of AI-touched tests is itself a problem, similar to the problem duplicative and verbose AI-generated code.

      And tests are inherently imperfect, they may not test the perfect layer, so they break when they shouldn't, and they certainly don't capture every premise.

      I'm on board with the tactics you suggest, but they are only incrementally helpful. What we really need is AI that removes duplicative code and unnecessary tests.

  • > AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep

    If the abstraction that the code uses is "right", there will be hardly any edge cases, and something to break three layers deep.

    Even though I am clearly an AI-hater, for this very specific problem I don't see the root cause in these AI models, but in the programmers who don't care about code quality and thus brutally reject code that is not of exceptional quality.

    • > programmers who don't care about code quality and thus brutally reject code that is not of exceptional quality.

      Is there a typo here? If they don't care about code why would they reject code based on quality?

    • I mean in theory yes, good abstractions solve a lot - but in practice you're rarely starting from a clean slate. you're integrating with third-party APIs that have weird edge cases, working with legacy code that wasn't designed for what you're doing now, dealing with requirements that change mid-implementation. even with great abstractions the real world bleeds through. and AI doesn't know which abstractions are 'right' for your specific context, it just pattern-matches what looks reasonable. so you end up reviewing not just for bugs but to make sure it's not subtly incompatible with your architecture

  • > AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep

    I will imagine that in the future this will be tackled with a heavy driven approach and tight regulation of what the agent can and cannot touch. So frequent small PRs over big ones. Limit folder access to only those that need changing. Let it build the project. If it doesn't build, no PR submissions allowed. If a single test fails, no PR submissions allowed. And the tests will likely be the first if not the main focus in LLM PRs.

    I use the term "LLM" and not "AI" because I notice that people have started attributing LLM related issues (like ripping off copyrighted material, excessive usage of natural resources, etc) to AI in general which is damaging for the future of AI.

    • yeah test-driven constraints help a lot - we've been moving that direction too, basically treating the agent like a junior dev who needs guard rails. the build+test gates catch the obvious stuff. but the trickier part is when tests pass but the code still isn't what you wanted - like it works but takes a fundamentally wrong approach, or adds unnecessary complexity. those are harder to catch with automation. re: LLM vs AI terminology - fair point, though I think the ship has sailed on general usage. most people just say AI to mean 'the thing that writes code' regardless of what's under the hood

    • > I use the term "LLM" and not "AI" because I notice that people have started attributing LLM related issues (like ripping off copyrighted material, excessive usage of natural resources, etc) to AI in general which is damaging for the future of AI.

      I think you have that backwards.

      The resource and copyright concerns stem from any of these "AI" technologies which require a training phase. Which, to my knowledge, is all of them.

      LLMs are just the main targets because they are the most used. Diffusion models have the same concerns.

  • What surprises me is that this obvious inefficiency isn't competed out of the market. Ie this is clearly such a suboptimal use of time and yet lots of companies do it and don't get competed out by other ones that don't do this

    • I think the issue is everyone's stuck in the same boat - the alternative to using AI and spending time reviewing is just writing it yourself, which takes even longer. so even if it's not a net win, it's still better than nothing. plus a lot of companies aren't actually measuring the review overhead properly - they see 'AI wrote 500 lines in 2 minutes' and call it a productivity win without tracking the 3 hours spent debugging it later. the inefficiency doesn't get competed out because everyone has the same constraints and most aren't measuring it honestly

  • To eliminate this tax I break anything gen-ai does in to the smallest chunks possible.

  • I disagree. I used to spend most of my time writing code, fixing syntax, thinking through how to structure the code, looking up documentation on how to use a library.

    Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.

    Bugs that the AI agent would write, I would have also wrote. Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.

    I also find that the AI writes more bug free code than I did. It handles cases that I wouldn’t have thought of. It used best practices more often than I did.

    Maybe I was a bad dev before LLMs but I find myself producing better quality applications much quicker.

    • > Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.

      I don't understand, how can you not fault AI for generating code that can't handle unexpected data gracefully? Expectations should be defined, input validated, and anything that's unexpected should be rejected. Resilience against poorly formatted or otherwise nonsensical input is a pretty basic requirement.

      I hope I severely misunderstood what you meant to say because we can't be having serious discussions about how amazing this technology is if we're silently dropping the standards to make it happen.

      3 replies →

    • > Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.

      This is likely the future.

      That being said: "I used to spend most of my time writing code, fixing syntax, thinking through how to structure the code, looking up documentation on how to use a library.".

      If you are spending a lot of time fixing syntax, have you looked into linters? If you are spending too much time thinking about how to structure the code, how about spending some days coming up with some general conventions or simply use existing ones.

      If you are getting so much productivity from LLMs, it is worth checking if you were simply unproductive relative to your average dev in the first place. If that's the case, you might want to think, what is going to happen to your productivity gains when everyone else jumps on the LLM train. LLMs might be covering for your unproductivity at the code level, but you might still be dropping the ball in non-code areas. That's the higher level pattern I would be thinking about.

      5 replies →

    • You have way more trust in test suites than I do. How complex is the code you’re working with? In my line of work most serious bugs surface in complex interactions between different subsystems that are really hard to catch in a test suite. Additionally in my experience the bugs AI produces are completely alien. You can have perfect code for large functions and then somewhere in the middle absolutely nonsensical mistakes. Reviewing AI code is really hard because you can’t use your normal intuitions and really have to check everything meticulously.

      4 replies →

  • The review cost problem is really an observability problem in disguise.

    You shouldn't need to read every line. You should have test coverage, type checking, and integration tests that catch the edge cases automatically. If an AI agent generates code that passes your existing test suite, linter, and type checker, you've reduced the review surface to "does this do what I asked" rather than "did it break something."

    The teams I've seen succeed with coding agents treat them like a junior dev with commit access gated behind CI. The agent proposes, CI validates, human reviews intent not implementation. The ones struggling are the ones doing code review line-by-line on AI output, which defeats the purpose entirely.

    The real hidden cost isn't the API calls or the review time - it's the observability gap. Most teams have no idea what their agents are actually doing across runs. No cost-per-task tracking, no quality metrics per model, no way to spot when an agent starts regressing. You end up flying blind and the compounding costs you mention are a symptom of that.

    •   > You should have test coverage, type checking, and integration tests that catch the edge cases automatically.
      

      You should assume that if you are going to cover edge cases your tests will be tens to hundredths times as big as the code tested. It is the case for several database engines (MariaDB has 24M of C++ in sql directory and 288M of tests in mysql-test directory), it was the case when I developed VHDL/Verilog simulator. And not everything can be covered with type checking, many things, but not all.

      AMD's FPU had hundredths of millions test cases for its' FPU and formal modeling caught several errors [1].

      [1] https://www.cs.utexas.edu/~moore/acl2/v6-2/INTERESTING-APPLI...

      SQLite used to have 1100 LOC of tests per one LOC of C code, now the multiplier is smaller, but still is big.

    • > You shouldn't need to read every line. You should have test coverage, type checking, and integration tests that catch the edge cases automatically.

      Because tests are always perfect and fetch every corner-case, and are even detecting all unusual behaviour they are not testing for? Seems unrealistic. But explains the sharp rise of AI-slop and self-inflicted harm.

    • That's a lovely idea but it's just not possible to have tests that are guaranteed to catch everything. Even if you can somehow cover every single corner case that might ever arise (which you can't), there's no way for a test to automatically distinguish between "this got 2x slower because we have to do more work and that's an acceptable tradeoff" and "this got 2x slower because the new code is poorly written."

    • I'd absolutely want to review every single line of code made by a junior dev because their code quality is going to be atrocious. Just like with AI output.

      Sure, you can go ahead and just stick your head in the sand and pretend all that detail doesn't exist, look only at the tests and the very high level structure. But, 2 years later you have an absolutely unmaintainable mess where the only solution is to nuke it from orbit and start from scratch, because not even AI models are able to untangle it.

      I feel like there are really two camps of AI users: those who don't care about code quality and implementation, only intent. And those who care about both. And for the former camp, it's usually not because they are particularly pedantic personalities, but because they have to care about it. "Move fast and break things" webapps can easily be vibe coded without too much worry, but there are many systems which cannot. If you are personally responsible, in monetary and/or legal aspects, you cannot blame the AI for landing you in trouble, just as much as a carpenter cannot blame his hammer for doing a shit job.

Nice article. I think a key part of the conversation is getting people to start thinking in terms of evals [1] and observability but it's been quite tough to combat the hype of "but X magic product just solves what you mentioned as a concern for you".

You'd think cost is an easy talking point to help people care but the starting points for people are so heterogeneous that it's tough to show them they can take control of this measurement themselves.

I say the latter because the article is a point in time and if they didn't have a recurrent observation around this, some aspects may radically change depending on the black box implementations of the integrations they depend on (or even the pricing strategies).

[1] https://ai-evals.io/

The brain trims it's context through forgetting details that do not matter

LLMs will have to eventually cross this hurdle before they become our replacements

what i've learned running multi-agent workflows... >use the expensive models for planning/design and the cheaper models for implementation >stick with small/tightly scoped requests >clear the context window often and let the AGENTS.md files control the basics

  • there’s something of a paradox there. Reduce the context window and work on smaller/tightly scoped requests? Isn’t the whole value proposition that I can work much faster? To do that, I naturally try to describe what I want at a higher, vaguer level.

    • That's where something like openspec and beads come in. You work high level, create a spec and break it down into beads (small tasks). Your main agent then spawns workers that perform a task with limited scope.

The cache gets read at every token generated, not at every turn on the conversation.

  • Depends on which cache you mean. The KV Cache gets read on every token generated, but the prompt cache (which is what incurs the cache read cost) is read on conversation starts.

I'm not sure, but I think that cached read costs are not the most accurately priced, if you consider your costs to be costs when consuming an API endpoint, then the answer will be 50k tokens, sure. But if you consider how much it costs the provider, cached tokens probably have a way higher margin than (the probably negative margin of ) input and output inference tokens.

Most caching is done without hints from the application at this point, but I think some APIs are starting to take hints or explicit controls for keeping state associated with specific input tokens in memory, so these costs will go down, in essence you really don't reprocess the input token at inference, if you own the hardware it's quite trivial to infer one output token at a time, there's no additional cost, if you have 50k input tokens, and you generate 1 output token, it's not like you have to "reinfer" the 50k input tokens before you output the second token.

To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.

This is relevant in an application I'm working on where I check the logprobs and not always choose the most likely token(for example by implementing a custom logit_bias mechanism client-side), so you can infer 1 output token at a time. This is not quite possible with most APIs, but if you control the hardware and use (virtually) 0 cost cached tokens, you can do it.

So bottomline, cached input tokens are almost virtually free naturally (unless you hold them for a loong period of time), the price of cached input APIs is probably due to the lack of API negotiation as to what inputs you want to cache. As APIs and self-hosted solutions evolve, we will likely see the cost of cached inputs masssively drop down to almost 0. With efficient application programming the only accounting should be for output tokens and system prompts. Your output tokens shouldn't be charged again as inputs, at least not more than once.

  • While some efficiencies could be gained from better client-server negotiation, the cost will never be 0. It isn't 0 even in "lab conditions", so it can't be 0 at scale. There are a few miss-conceptions in your post.

    > the time it takes to generate the Millionth output token is the same as the first output token.

    This is not true, even if you have the kv cache "hot" in vram. That's just not how transformers work.

    > cached input tokens are almost virtually free naturally

    No, they are not in practice. There are pure engineering considerations here. How do you route, when you evict kv cache, where you evict it to (RAM/nvme), how long you keep it, etc. At the scale of oAI/goog/anthropic these are not easy tasks, and the cost is definetly not 0.

    Think about a normal session. A user might prompt something, wait for the result, re-prompt (you hit "hot" cache) and then go for a coffee. They come back 5 minutes later. You can't keep that in "hot" cache. Now you have to route the next message in that thread to a) a place where you have free "slots"; b) a place that can load the kv cache from "cold" storage and c) a place that has enough "room" to handle a possible max ctx request. These are not easy things to do in practice, at scale.

    Now consider 100k users doing basically this, all day long. This is not free and can't become free.

  • GPU VRAM has an opportunity cost, so caching is never free. If that RAM is being used to hold KV caches in the hope that they'll be useful in future, but you lose that bet and you never hit that cache, you lost money that could have been used for other purposes.

  • Caching might be free, but I think making caching cost nothing at the API level is not a great idea either considering that LLM attention is currently more expensive with more tokens in context.

    Making caching free would price "100000 token cache, 1000 read, 1000 write" the same as "0 token cache, 1000 read, 1000 write", whereas the first one might cost more compute to run. I might be wrong at the scale of the effect here though.

  • This matches my experience running coding agents at scale. The cached token pricing is indeed somewhat artificial - in practice, for agent workflows with repeated context (like reading the same codebase across multiple tasks), you can achieve near-zero input costs through strategic caching. The real cost optimization isn't just token pricing but minimizing the total tokens flowing through the loop through better tool design.

    • Are you hosting your own infrastructure for coding agents? At least from first glance, sharing actual codebase context across compacts / multiple tasks seems pretty hard to pull off with good cost-benefit unless you have vertical integration from the inference all the way to the coding agent harness.

      I'm saying this because the current external LLM providers like OpenAI tend to charge quite a bit for longer-term caching, plus the 0.1x cache read cost multiplied by # LLM calls, so I doubt context sharing would actually be that beneficial considering you won't need all the repeated context every time, so caching context results in longer context for each agentic task which might increase API costs by more overall than you save by caching.

      1 reply →

[flagged]

  • > Instead of feeding 500 lines of tool output back into the next prompt

    Applies for everything with LLMs.

    Somewhere along the idea, it seems like most people got the idea that "More text == better understanding" whereas reality seems to be the opposite, the less tokens you can give the LLM with only the absolute essentials, the better.

    The trick is to find the balance, but "more == better" which many users seem to operate under seems to be making things worse, not better.

  • > Too little and the agent loses coherence.

    Obviously you don't have to throw the data away, if the initial summary was missing some important detail, the agent can ask for additional information from a subthread/task/tool call.