Comment by BeetleB
2 days ago
First: There's the obvious "If the company is letting me do it, I'll be wasteful." This includes not clearing/compacting the context often. Opus now has a 1M context window, and quality is good to at least 200K. So each query is burning a lot of tokens until you clear/compact.
People have already mentioned the size/complexity of the codebase. I'm new to my team and the codebase isn't huge, but it's large enough that there are plenty of parts I have little understanding about. When I'm given a task, then yes, I definitely go to Claude and ask it to find the relevant parts of code so I can understand the existing workflow before even attempting to change it.
The downside is that I don't build expertise. But the reality is that with Claude, I can get the work done in 1 day that would take me 5 days of struggling, and if everyone is doing it, I can't be left behind. So I take the middle route - I get it done in 2-3 days instead of 1 so I can at least spend some time with the code.
Especially with AI, the rate at which code changes in our codebase is insane. So I built a tool that takes a pull request, and tells the LLM to go deep and explain to me what that pull request does. (Note: I'm not the reviewer, I just want to keep tabs on the work that is going on in the team).
And this is just the beginning. I haven't actually spent time to come up with more ways to use the LLM to help me.
My usage is similar to yours, but if I were fairly experienced with the code base, I'd do a lot more. I haven't asked, but I suspect there are people in my team who go over $1K/month.
As always, the bottleneck is proper testing and reviews.
Edit: I'll also add that for not-so-important code used within the company, I suspect most people are going full-AI with it. For my personal (non-work) code, I just let the AI code it all - the risk is usually very low (and problems are caught quickly). If someone is using the "superpowers" skill, then even for basic features you can burn lots of tokens. I usually start with 20-40K tokens and end up with 80-90K tokens when it's finished. Which means that many of the requests prior to completion were sending in close to 80K tokens. Multiply that with the number of queries, etc.
Wasteful, but if someone else is paying ...
> This includes not clearing/compacting the context often. Opus now has a 1M context window, and quality is good to at least 200K. So each query is burning a lot of tokens until you clear/compact.
I see this repeated by others, including coworkers. It completely ignores caching. Caching itself is complicated, but the "longer context window = more expensive" is not 100% true and you are hampering yourself if you're not taking full advantage of large context windows.
You still pay for cache hits and refreshes, but the cost is lower.
The default Claude cache expires in 5 minutes. If you take a short break to review the code, talk to someone, or do anything other than continuously interact with the session it's going to get evicted and start over.
You can opt in to a 1-hour cache at a higher rate https://platform.claude.com/docs/en/build-with-claude/prompt...
Also anecdotally, caching has just been broken at times for me. I've had active conversations where turns less than 5 minutes apart were consuming so much quota that I doubt anything was being billed at the cache rate.
If you look at the actual cost of your Claude Code conversations, you'll see that the cost is overwhelmingly dominated by the cost of input tokens (cached). Because of how we construct persistent conversations, each cached input token incurs cost on each API request, meaning that component of cost scales with O(request count). If you graph the cost curve of a claude code session, it's very obvious that this scaling factor overwhelms the cache discount.
Here is a blog post that shows some data - https://blog.exe.dev/expensively-quadratic. And I can confirm this is true for Claude Code - I set up a MITM capture for all Claude Code requests and graphed it.
So increasing Request Count that reuses the same prefix (which is what higher compaction thresholds do) really does lead to (substantially) higher API costs.
Caching is pretty simple. If it's a prefix match, it's cacheable. Very long context windows will be much more expensive than shorter ones, even with caching, assuming you're using Claude Code or some similar harness for both. You'll get caching in both, but you'll pay more for the longer context. The cost of occasional compaction is more or less negligible compared to the massive cost of the input tokens that are getting charged repeatedly for every single request.
If you have 500k context, three turns will burn ~1.5MM tokens. If you have 250k context, three turns will burn ~750k tokens. If you have 125k context, three turns will burn ~375k tokens. Claude can at most generate 32k output tokens per turn in Claude Code (and it rarely does so), so despite the higher price of output tokens, almost all costs are dominated by input token costs. Even at cached input prices, cost scales near-linearly with context length: if you 2x your context length, you'll roughly ~2x your cost.
Now, it might be the case that longer context windows allow Claude to complete the task better — although I'd be surprised if there were many tasks requiring >200k tokens just to get the job done (that's nearly ten full copies of Shakespeare's "A Midsummer Night's Dream"). And they're definitely convenient, in the sense that you don't need to think about context management as much and worry about a sudden, unexpected autocompact wrecking things if you aren't carefully manually compacting at logical points. But they're definitely more expensive on a near-linear basis and you're paying for that convenience.
It’s crazy that people don’t understand cached tokens despite them being priced separately on the cost pages of every single provider.
Its crazy that people think caching is such a silver bullet, despite the cost of long context windows still being ridiculously high even with caching. https://news.ycombinator.com/item?id=47000034
> It’s crazy that people don’t understand cached tokens despite them being priced separately on the cost pages of every single provider.
Depends on your subscription type. Some are just a flat monthly fee.
1 reply →
> But the reality is that with Claude, I can get the work done in 1 day that would take me 5 days of struggling,
Is it really a 5x ROI? Where are all the apps, games, platforms, SAAS's, feature s that have been backlogged for 5 years that are all of a sudden getting done? Because I see a modest ROI, and an _awful lot_ of shovelware.
> I'm new to my team and the codebase isn't huge, but it's large enough that there are plenty of parts I have little understanding about.
When you're new to the codebase, things that take an experienced colleague one day to do can take a newbie 3-5 days to do.
> First: There's the obvious "If the company is letting me do it, I'll be wasteful." This includes not clearing/compacting the context often. Opus now has a 1M context window, and quality is good to at least 200K. So each query is burning a lot of tokens until you clear/compact.
What is wasteful? If you are costing the organization $x/hr, and spend an hour saving the company $(x*0.5), you didn't save money, you wasted it.
To the company, are you spending more time being token efficient to save less money than they're paying you for the time? That's not even getting into opportunity costs.
There is some extreme wasteful spending of AI tokens out there. But trying to get below $3k/month in token costs is often of questionable value.