Comment by harimau777
1 day ago
The comments I see recommending selective use of cheaper models doesn't match the reality I experience working in the industry. I have the constant threat hanging over my head of being fired if I don't churn out code quickly enough. I'm not willing to gamble with my livelyhood by using a less effective model.
Saving money on tokens isn't something that's rewarded during performance reviews; particularly because it's difficult to quantify how much you saved versus hypothetically using a more expensive model.
I think quantifying tokens used is analogous to quantifying the amount of sawdust generated on a construction site.
Churning out useful code quickly is not solved by using more tokens per unit time. Most non-technical leaders can grasp this one and are likely more interested in the strategic game theoretical dynamics that are being forced by way of implied token consumption expectations (competition between developers).
If you want to hold out as long as possible and don't really care about anything other than the compensation package, you should at least play along with this new game in a half-assed manner. Try to goldilocks your token usage between any established extremes. You want to be in the statistical barycenter of every AI report that management can create.
To understand the token count thing - spending tokens is necessary and not sufficient to demonstrate that you are adopting AI.
Where we were 6mo ago is that a lot of big orgs realized they were behind, and needed some way of measuring if the tools were usable at all.
No sawdust at all on your job site, and you can tell nobody is cutting wood.
Now that tooling is more mature, you can measure things like % of diffs AI-generated, % of AI suggestions accepted vs edited, % of KB queries successful etc - all more useful than raw token count for quantifying how your org is using the tool.
So it’s a pragmatic metric that got a bit Goodhearted.
No sawdust is bad. But it's also bad if you cut all your boards into sawdust. Completely. Obliterated. No useful output, only sawdust.
% of AI suggestions accepted vs. edited is also a BS metric that Anthropic et. al. like to push, similar to LoC, because they're large numbers and large numbers must be good, right?
Well guess what, I have auto-accept on and then adjust after it's "done". And I do it by telling it what changes to make and those have auto-accept on as well. That's quite a high "accept" rate, by definition. But in reality it may have churned on 50% of the lines it generated and auto-accepted first.
3 replies →
My feeling is it's not as bad of a metric as people think. Companies don't fully know the best way to use AI and things are changing rapidly, so you want people using a lot of tokens even on stuff that seems maybe kind of dumb on the surface, because if you find one useful thing and share it in the org that makes up for a lot of failures.
But I do think you also need to say, "To be clear, don't game the system. Any token usage that is even remotely justifiable as useful for the business is fine, and we will give you a lot of latitude. But if you're in the top 10% of token users, we are going to review your token usage, and if we find that you have a dozen agents perpetually running writing slam poetry, you're going to get fired."
1 reply →
> % of AI suggestions accepted vs edited
this has to be the worst metric.
anytime the llm wants me to read a diff of one file, im just gonna send it forward so i can read the whole diff
It is not a pragmatic metric in any way. And they are not evaluating whether it is useful by maximizing spend tokens. You need to, well, evaluate various kind of uses to do that.
It is oddly unprecendeted economic behavior.
That sawdust analogy is fantastic!
We may be on the cusp of the AI age's new era of 'measure twice, cut once'.
Suddenly, LoC returned
With the rise of agentic coding, this has become a sign of quality for me in my own PRs and reviews: New features implemented in less than a thousand lines of productive code.
When I'm working on code that was heavily vibecoded, most of my PRs are reducing LoC by a couple hundreds of lines while fixing bugs or implementing a new feature.
My job kind of feels like being a garbage man, luckily my current employer appreciates it. Personally I think the current style of vibecoding only kinda works, because models are getting better fast enough to keep the shitpile from overflowing completely. Betting on the harnesses + models getting good enough to clean up after themselves is a bet, and I don't like gambling, but even I admit the odds don't seem to be bad.
Slowly and then suddenly :)
""" Steve Ballmer In IBM there's a religion in software that says you have to count K-LOCs, and a K-LOC is a thousand line of code. How big a project is it? Oh, it's sort of a 10K-LOC project. This is a 20K-LOCer. And this is 5OK-LOCs. And IBM wanted to sort of make it the religion about how we got paid. How much money we made off OS 2, how much they did. How many K-LOCs did you do? And we kept trying to convince them - hey, if we have - a developer's got a good idea and he can get something done in 4K-LOCs instead of 20K-LOCs, should we make less money? Because he's made something smaller and faster, less KLOC. K-LOCs, K-LOCs, that's the methodology. Ugh anyway, that always makes my back just crinkle up at the thought of the whole thing. """
From https://www.pbs.org/nerds/part2.html
1 reply →
> I have the constant threat hanging over my head of being fired if I don't churn out code quickly enough.
And the tragedy is that this isn't sustainable, and we all involved deeply in tech know this. There is eventually going to be a big reality check the companies will have to pay, because you can't force creativity and quality, not even with AI, because actual intelligence lies with us at least for now and for the foreseeable future. However when the rope eventually snaps these executives at best will fall upwards, with big severance bonuses and a list of "contributions" we have to be grateful for. We are the ones that will suffer through the next big layoffs.
Unfortunately, I think this is correct. Such as it ever has been with technological change. The folks at the bottom bear the brunt of the dislocation and the folks at the top pat themselves on the back for being so forward looking and get huge payouts regardless of the actual results. Further, the folks at the top are always incentivized to go along with the herd of their peers because if it works then they were on the bandwagon, and if it doesn’t work, well then, how could they have known because “Everyone was deceived.”
> because if it works then they were on the bandwagon, and if it doesn’t work, well then, how could they have known because “Everyone was deceived.”
They call themselves "risk takers" to justify their high pay.
1 reply →
Most companies do not care about quality. _users_ who have to interact with that software will pay the price.
Exemple from one of the wealthiest company in existance, for one of its most strategic product: I was trying gemini-cli on some mcp servers just yesterday, with gemini-chat helping me configuring everything. In less than 10 minutes, I stumbled upon 3 or 4 different bugs. Eventually, even gemini-chat recommended that I throw gemini-cli in the bin and move on to another agent... That's the new norm.
How much creativity do you need to fix bugs in corporate code? Almost zero. It’s maintenance, not creative work. Nothing against it, it’s needed, but let’s be real, would anybody be really sad if this work is overtaken by LLMs? I certainly won’t be, let them do it.
> How much creativity do you need to fix bugs in corporate code? Almost zero.
Have you seen the state of current corp software? I'd say a lot of creativity is still very much needed. Let's see how long this is sustainable.
> would anybody be really sad if this work is overtaken by LLMs?
I'd not be sad about the job itself, but the dev which had a mortgage to pay but now is substituted by a machine churning crap code while their superiors get sore from patting themselves on the back.
IBM system/360 OS had more than 50,000 bugs which could not be fixed because fixing any single bug would introduce two new bugs. I fear that a lot of AI software systems will reach the same crapware state as IBM system/360 very very soon!
I know from personal experience that once you fix a bug introduced by Claude, Claude tries to recreate the bug every time he edits that code again!!
Anyone (including ANTHROP\C) "recommending selective use of cheaper models" is spending costly human time (which costs more over time) on correcting the machine (which costs less over time). This is a bad trade.
In cost per line of code, we have verified this is always an error unless your time is worth less than the machine (unlikely unless you consider your time to have no cost rather than considering it as your hourly rate).
The worst thing for our productivity has been Claude Code or Claude Cowork taking a complex problem and turning around and writing bad instructions for dumb model agents then synthesizing the dumb answers into an orchestra of badness.
The single best fix for results-per-total-cost is to ensure it reads and thinks about whole content, not snippets, and thinks with the smartest model, not agents.
Agents should toil. Agents should neither think*, nor decide what to think about which itself is thinking.
* Agents should “think” like ants or bees or beavers think. Any human-like thinking, *especially* intuition-like thinking, should be thought by the best model available.
** Nobody should be “churning out code”. In a hierarchy of coders who translate detailed specs to some computer language, developers who write software that ships on a project timeline, and engineers who accomplish business goals, engineers should “churn out” engines structured for business outcomes.
Measured by that, the machine is leverage while reducing a variety of costs. At the same time, because most training data doesn't grok this, the machine doesn't grok it either. So it needs you to shape its toil.
I disagree heartily with everything here, both in personal experience from the models, and in values about coding.
I don't care bout cost, I care about getting good results fast.
Cost per line of code is not a suitable metric for anything. It's as silly as measuring engineers' performance by lines of code. More lines of code is worse than fewer lines of code. When you say "we have verified" whoever that "we" is makes a big difference, but you're posting pseudonymously, how are we to even guess at that "we"?
I get better results with some older cheaper models, faster. In particular older Claude models than Opus 4.7. Maybe the more expensive model churns out more lines, more complexity faster. That is a worse outcome for me. The complexity must be avoided at all costs. The simpler, smaller, answer is always better, and scales to bigger code bases. The more the model guesses at intent rather than checking intent, the more the model is clever rather than clear and simple, the worse the outcome, the more that the model turns into an architecture astronaut, the worse the outcome.
I’d point out that smaller and simpler also makes their router code easier to review and that fewer lines will have fewer bugs (on average) and those bugs will be more obvious. But then, I’m old school and won’t let an AI work on code without reviewing it, and I mostly write code by hand.
Yes, cost per line of code itself is an error.
Only cost for effective* outcome matters. And if your lines of code have a cost, you would want fewer lines of code to achieve the outcome, not more.
Are you sure you disagree with that?
* If your place of work starts talking "efficiency"**, run. Find somewhere the conversation is *effectiveness* — at the goal/outcome level.
** Not to mention that "efficiencies" is MBA speak for "right sizing" away effectiveness.
Too many people see wages as a sunk cost and a constant. One problem though is AI costs per task are unpredictable, and management tends to prefer predictable outcomes over optimal outcomes.
> The single best fix for results-per-total-cost is to ensure it reads and thinks about whole content, not snippets, and thinks with the smartest model, not agents.
I haven't seen "just absorb a giant ball of context and do the right thing the first time" be cracked yet, even for Opus 4.7.
At the end of the day, code is code, and we have decades of lessons about how to make code more reliable and maintainable. Composable small modules, not god methods, are still the way to go, and they reward devs who use them to get focused context for agents with faster - and often better - results.
I haven't seen "just absorb a giant ball of context and do the right thing the first time"
Exactly.
No more than sitting down and writing code before a product concept or spec or architecture comes out right the first time, or fifth.
Absorb the concept, make a shape of outcome, then a spec, then hold its hand to architect a series of iterations, either component by component or thin vertical slice or whatever combination lets you iterate in working increments...
Your brain, machine leverage. After all, it types faster than you. But it should type what you want.
You know what it should type, right? If you don't, you're gonna have a bad time anyway.
If you have such toxic environment, run.
If you’re sitting under a tree in the rain and it gets soaked through and you start getting wet, finding another tree won’t help you.
The whole industry is adjusting to the reality that the expected output of an engineer is much higher than it used to be. It’s not local to one company. You may find a better environment for the time being, but this is the direction everything is headed.
I don’t disagree that the expectations are higher, but token output hardly correlates to code output worthy of merging.
13 replies →
*the whole industry in countries without strong worker rights
2 replies →
It’s too bad that, yet again, instead of the productivity gains leading to shorter work weeks, the benefits accrue to the companies. Just once I’d like to see productivity gains lead to more leisure time, not higher expectation.
2 replies →
Maybe once we get universal income we can start recommending this. Until then the individual isn't to blame when the only option to keep providing is to keep grinding in a toxic environment.
But I'd agree that everyone can start planning a career shift that'll span a few months to some years in order to seek better working conditions. Passively accepting all work degradation because that's life and money is needed is partly responsible for the current situation too.
Where to, that's the question. The economy is in the gutters and the replace-people-with-AI craze is making the issue even worse.
Perhaps for now. But you know, after working solid with AI for two years and adopting effective methods using detailed plans, and having a lot of success with it, here is the problem:
Coding faster leads to less understanding and higher long-term risk. Source-Code amnesia is real, and there’s a time requirement to really understand and appreciate what a system is actually doing.
I’ve been able to implement very large features using frontier models, but the code needs to always be revisited.
AI can do two things: find vulnerabilities, and prototype code. It cannot design software, and any appearance of such is an illusion at best.
We don’t need to produce faster to be successful, we need to create better, long lasting products.
4 replies →
Now as you can see from the article, it starts turning. People are getting less pricey than agents on API pricing.
Copilot switches to API pricing starting next month (let's see how long it will last for our $39, and $19 since September), Anthropic switches all corps into API based pricing. From the most popular choices I think only Codex didn't switch yet (although it is hard to tell because I don't know their enterprise pricing).
1 reply →
> The economy is in the gutters
Consumer sentiment is in the gutters certainly. But objective measures of the economy like unemployment and real wages look good to excellent
https://fred.stlouisfed.org/series/UNRATE
https://fred.stlouisfed.org/series/LES1252881600Q
4 replies →
And open positions are simply because someone decided to run from that place
> I don't churn out code quickly enough
Curious what industry that is.
This, I happily used the opus 4.6 fast mode to the tune of 5k for a project. The delivery of the project justified the 5k, if I only spent 500 but delivered the project 1 month later - I would have been in the dog house.
Your project cost $5k in tokens? How does that work? over what time? My understanding is that most developers are given pro max plans at $200/m and are expected to max that out.
I've been getting by on the $200/year plan by smoothing usage continuously over time.
The pay per use is for the API so does it mean you're using the API in a custom setup?
Maxed out the 200 dollar per month plan then incurred the overage. This was over the span of 1 month or so.
2 replies →
My real comment is, why were they not just using their self-hosted copies of it? Do they pay back Anthropic for use of it in Azure? Broker a deal, let Anthropic charge you drastically less to use their model AND Anthropic could have made Claude Code work directly with Azure for Microsoft employees. Pennies on the dollar, and Microsoft could do it using low use GPUs to save on cost, or stack underused GPU compute (this is how serverless was born btw - its the unused resources in a web server somewhere).
When you consider that xAI's old data center was enough to bring Anthropic back ahead, it tells me Microsoft could host their own on underutilized previous gen GPUs that are sitting there wasting server real estate.
> The comments I see recommending selective use of cheaper models doesn't match the reality I experience working in the industry. I have the constant threat hanging over my head of being fired if I don't churn out code quickly enough. I'm not willing to gamble with my livelyhood by using a less effective model.
I don't buy it. Old models such as GPT4.1 were faster than newer reasoning models, and their output was as good. Newer models end up wasting an ungodly amount of time with chain-of-thought steps which can be a complete waste of time if you have a structured prompt such as a plan or a spec.
My experience in the real world is that users have to ration requests, and x0 models actually tend to be used far more because expensive models are left for more complex tasks.
Are you saying you found GPT 5.5 to be as good as 4.1 for coding?
[dead]
This, if you’re high performing, the company won’t question your use of tokens. If they want to limit it, they have ways to set limits on spend and usage.