You give it a problem, you then refine that problem where a fast, cheaper model asks you questions which you answer to get a better input prompt. You then choose a MA strategy for example take problem break up to sections then final judge concludes or you do multi turn where agents debate then judge summarises debate.
The best approach is what I call 'all angles' where all these strategies run in parallel the final meta-judge synthesise the response - the most useful part of this which I recently added is a view to see the variance in each strategy.
Been using this for life stuff - housing search, schools, family challenges!
Perhaps I should make a video of it in action if people in HN community interested let me know.
You mention cost in one of the replies. Can you elaborate on the cost profile (ballpark) for various problem types? I would also be curious to understand the strategies employed and what the costs look like across each.
Sure let me do that. Can I post this as a ShowHN if its just video? The rules say people need to try out but that will cost me a small fortune :) ...could perhaps post on Github and people can setup the repo themselves with their own Openrouter key if that works. Have never done a ShowHN but would be fun to try it.
Homebrew harness and all frontier ones plus deepseek. All via Openrouter at the moment. Works well enough but can get expensive so use for real high value challenges. Interestingly the refine feature has been most useful to me and people I have shown, essentially people are lazy when expressing the initial problem (me included!), refine asks relevant questions to initial problem then refines the initial statement, user can accept/reject/edit before submitting.
I think it's more a consequence of pushing for the biggest valuation/IPO. Rumoured profits on inference are north of 70%.
Taking SpaceX as an example, they have increased prices across all their consumer products over the past six months. But they definitely aren't short on money with Alphabet and Anthropic combined paying them over $2 billion per month.
Microsoft/GitHub lost out here as they were just repacking other people's products.
Inference can only happen after having invested in training and datacenter construction. Arguing about "inference profitability" sounds a lot to me like ignoring large cost centers of these comanies.
The github example is also a bit of an outlier because they made a recent change to their pricing so that's why its such a drastic jump.
Also I mean prices in generally for all things are based on underlying factors, that doesn't make them arbitary (i.e. github executives using a random number generator for token pricing would be arbitary)
> Furthermore, we observe that input tokens consistently constitute the largest share of consumption for an average of 53.9%
I'm seeing a ratio of around 10:1 in my usage. A vast majority of the tokens consumed are on the input side. The agent will often read a million tokens just to patch one line of code.
I think if you are seeing something closer to 1:1 or more on the output side, there is either a problem with the agent or the codebase is new / empty.
If input tokens dominate the cost to that extent, this implies that major gains are possible by making better use of caching. You could basically ask the model to do a one-time "compaction" step including a dump of the relevant portions of the code, and use that as the cached prefix for a large amount of "swarm" subagent calls.
The target codebase is very large. A million tokens is a drop in the proverbial bucket.
I still don't understand how caching helps me very much. I must be misunderstanding it because I thought the user's prompt (which is the biggest variable) necessarily sits prior to all of these token intensive tool calls. How can we cache the reading of codebase if the prefix is always moving?
And AWS heavily pushes a complex lambda solution stringing together as many chargeable AWS services as possible for a simple requirement
Their interests are often not your interests. In this case they want you to unnecessary money on useless work (let's stop the euphemism of "tokens" btw)
you can just tell them to do more dynamic testing. I think dynamic testing is partly frowned upon because it slows things down & can take down software where you wouldn't expect
I hope this horrible time will soon be over when cheaper NPUs come available from more hardware companies, and also when model size get optimized down further.
I wonder what hyperscaled compute farms and models will be good for at that running cost when most AI needs can be fulfilled by on-prem and on-device hardware and models. Probably only customer left are big governments. So in the end the tax payer has to pay for those billions of investments by the AI cartel.
The typical NPU is only marginally helpful for on-prem inference. A GPU can read quantized data from main memory and dequantize/pad it locally (making effective use of memory throughput); a NPU often needs to read padded data directly from memory, which is wasteful. So it only helps a little bit wrt. prefill.
Also, smaller models can obviously be used but a smaller model will be a lot weaker in real-world knowledge and this tends to limit their smarts in a way that can't be compensated by more thinking.
Was in a meeting reviewing a potential new product, it was going well until they showed us that they had added AI to it (of course they have). It was pretty obviously just shoehorned in, and one part of that obviousness was that they had a column that showed how many tokens it took to make each query.
I asked who is paying for the tokens, they said its included in the license. I said, so is there a budget or is it all you can eat.
they said good question they didnt know and would get back to me. I said the reason i asked was just one query there had a 250k token burn on it. and it was a fairly simple query about one device.
then, one of the execs on their side was heard saying out loud "Why are we even showing this to the customers?"
it have us quite a chuckle.
But lesson learned... the cost of adding AI to anything isnt really being accounted for let alone the true cost of actually running the AI.
all things AI are going to get more expensive. even if you dont want the AI aspect.
Code review could also be run as an unattended/batched task though, possibly with at least some use of on-prem inference (which excels at this). That would be a major saving compared to the usual cloud inference scenario.
That assumes Tokens will remain a meaningful expense. I’m not sure developers will find uses for ever more tokens nearly as quickly as the prices fall.
How are we so confident that prices will fall? Isn't the exact opposite happening, right now, during arguably the most critical part of this whole saga (pre-IPO to make things appear as beautiful and as not-obviously-illegal as possible)? And the only reason they were "falling" previously was for hyper growth.
I would easily pay a lot of money to have access to AI for my job. I actually do pay. If the cost was significant I'd just add it to hourly rate that I consider acceptable. Company always pays in the end, because company is the only entity with money in this setup.
Tokenomics is already a word used to describe cryptocurrency economics, not sure why they'd try to redefine it for AI even if a different sort of token is used.
Crypto was already a term before cryptocurrencies made it about them. Web 3.0 was already a thing before crypto bros made web 3 about cryptocurrencies.
So what? Terms are reused in different contexts all the time. And most people have moved on from cryptocurrencies anyway, so there’s little chance it’ll confuse anyone.
At its current iteration the AI tech market is not economically sustainable, not for the other markets outside the AI economy, and most deadly not even for the main target customers or AI tech companies themselves. There have been several news of companies having overspent their token budget month after month. The hardware monopolist and his network of buddy companies can determine the token price as freely as they want, there are no competitors, their only "competitor" is when people stop using AI alltogether.
I have a MA system setup for personal use.
You give it a problem, you then refine that problem where a fast, cheaper model asks you questions which you answer to get a better input prompt. You then choose a MA strategy for example take problem break up to sections then final judge concludes or you do multi turn where agents debate then judge summarises debate.
The best approach is what I call 'all angles' where all these strategies run in parallel the final meta-judge synthesise the response - the most useful part of this which I recently added is a view to see the variance in each strategy.
Been using this for life stuff - housing search, schools, family challenges!
Perhaps I should make a video of it in action if people in HN community interested let me know.
Right here is the video demo of what I built - https://streamable.com/e49cgt
You mention cost in one of the replies. Can you elaborate on the cost profile (ballpark) for various problem types? I would also be curious to understand the strategies employed and what the costs look like across each.
Definitely interested, would love to see a video :)
Sure let me do that. Can I post this as a ShowHN if its just video? The rules say people need to try out but that will cost me a small fortune :) ...could perhaps post on Github and people can setup the repo themselves with their own Openrouter key if that works. Have never done a ShowHN but would be fun to try it.
So what harness are you using? And what LLM’s
Homebrew harness and all frontier ones plus deepseek. All via Openrouter at the moment. Works well enough but can get expensive so use for real high value challenges. Interestingly the refine feature has been most useful to me and people I have shown, essentially people are lazy when expressing the initial problem (me included!), refine asks relevant questions to initial problem then refines the initial statement, user can accept/reject/edit before submitting.
1 reply →
One month I could use Github Copilot fully with no disruptions. The next month, after pricing changes, I’ve run out of tokens in two days.
Such drastic changes tell me that pricing of tokens is arbitrary, and AI business is running out of money fast.
I think it's more a consequence of pushing for the biggest valuation/IPO. Rumoured profits on inference are north of 70%.
Taking SpaceX as an example, they have increased prices across all their consumer products over the past six months. But they definitely aren't short on money with Alphabet and Anthropic combined paying them over $2 billion per month.
Microsoft/GitHub lost out here as they were just repacking other people's products.
Inference can only happen after having invested in training and datacenter construction. Arguing about "inference profitability" sounds a lot to me like ignoring large cost centers of these comanies.
> Rumoured profits on inference are north of 70%.
Rumors are worth squat when they’re most likely put in motion by the people with a vested interest in this industry.
Let’s talk about profits when there’s real data from the IPO documentation.
3 replies →
SpaceX is increasing prices because they're trying really hard to get into the S&P 500.
The github example is also a bit of an outlier because they made a recent change to their pricing so that's why its such a drastic jump.
Also I mean prices in generally for all things are based on underlying factors, that doesn't make them arbitary (i.e. github executives using a random number generator for token pricing would be arbitary)
I wrote a Subsack post on this topic back in December https://open.substack.com/pub/zacharywhitley/p/the-coming-ag...
> Furthermore, we observe that input tokens consistently constitute the largest share of consumption for an average of 53.9%
I'm seeing a ratio of around 10:1 in my usage. A vast majority of the tokens consumed are on the input side. The agent will often read a million tokens just to patch one line of code.
I think if you are seeing something closer to 1:1 or more on the output side, there is either a problem with the agent or the codebase is new / empty.
If input tokens dominate the cost to that extent, this implies that major gains are possible by making better use of caching. You could basically ask the model to do a one-time "compaction" step including a dump of the relevant portions of the code, and use that as the cached prefix for a large amount of "swarm" subagent calls.
Did you experiment with giving agent better tools to navigate and document the codebase? Asts, language servers and so on?
A million tokens (not cached) sounds like a lot.
The target codebase is very large. A million tokens is a drop in the proverbial bucket.
I still don't understand how caching helps me very much. I must be misunderstanding it because I thought the user's prompt (which is the biggest variable) necessarily sits prior to all of these token intensive tool calls. How can we cache the reading of codebase if the prefix is always moving?
5 replies →
One thing I've noticed using agents for coding is that they really like to write thousands of unit tests but not dynamically test.
And they like to burn a ton of tokens writing and debugging tests that are semantically corrupt.
Unit tests are a type of dynamic testing. As opposed to static testing which is linting/typechecking etc.
If you want a difference kind of dynamic testing besides unit tests, have you tried writing it in as a requirement during the planning/PRD phase?
And AWS heavily pushes a complex lambda solution stringing together as many chargeable AWS services as possible for a simple requirement
Their interests are often not your interests. In this case they want you to unnecessary money on useless work (let's stop the euphemism of "tokens" btw)
This kind of cute conspiracy theories don’t actually hold true in real life. The companies want to make useful products.
1 reply →
you can just tell them to do more dynamic testing. I think dynamic testing is partly frowned upon because it slows things down & can take down software where you wouldn't expect
Reminded me of this paper from last year trying to optimize efficient token usage providing budget guidance information. [1]
[1] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Stee...
It’s just like Airline reward miles and offers no benefit to companies over just renting bare metal GPU time
I hope this horrible time will soon be over when cheaper NPUs come available from more hardware companies, and also when model size get optimized down further.
I wonder what hyperscaled compute farms and models will be good for at that running cost when most AI needs can be fulfilled by on-prem and on-device hardware and models. Probably only customer left are big governments. So in the end the tax payer has to pay for those billions of investments by the AI cartel.
The typical NPU is only marginally helpful for on-prem inference. A GPU can read quantized data from main memory and dequantize/pad it locally (making effective use of memory throughput); a NPU often needs to read padded data directly from memory, which is wasteful. So it only helps a little bit wrt. prefill.
Also, smaller models can obviously be used but a smaller model will be a lot weaker in real-world knowledge and this tends to limit their smarts in a way that can't be compensated by more thinking.
amusing side note:
Was in a meeting reviewing a potential new product, it was going well until they showed us that they had added AI to it (of course they have). It was pretty obviously just shoehorned in, and one part of that obviousness was that they had a column that showed how many tokens it took to make each query.
I asked who is paying for the tokens, they said its included in the license. I said, so is there a budget or is it all you can eat. they said good question they didnt know and would get back to me. I said the reason i asked was just one query there had a 250k token burn on it. and it was a fairly simple query about one device.
then, one of the execs on their side was heard saying out loud "Why are we even showing this to the customers?"
it have us quite a chuckle. But lesson learned... the cost of adding AI to anything isnt really being accounted for let alone the true cost of actually running the AI.
all things AI are going to get more expensive. even if you dont want the AI aspect.
AIshittification
First thought was "only 30 tasks" however the findings map to what I've seen personally: code review consumes majority of tokens
Code review could also be run as an unattended/batched task though, possibly with at least some use of on-prem inference (which excels at this). That would be a major saving compared to the usual cloud inference scenario.
with which models, though?
In the past Google et al would hire engineers based on how well they could optimize the infrastructure.
Maybe soon companies will look at how engineers can optimize the token efficiency of AI.
That assumes Tokens will remain a meaningful expense. I’m not sure developers will find uses for ever more tokens nearly as quickly as the prices fall.
How are we so confident that prices will fall? Isn't the exact opposite happening, right now, during arguably the most critical part of this whole saga (pre-IPO to make things appear as beautiful and as not-obviously-illegal as possible)? And the only reason they were "falling" previously was for hyper growth.
6 replies →
[dead]
I know how to drop a company’s token costs to zero: treat tokens as a utility same as internet and make engineers pay for it.
I would easily pay a lot of money to have access to AI for my job. I actually do pay. If the cost was significant I'd just add it to hourly rate that I consider acceptable. Company always pays in the end, because company is the only entity with money in this setup.
Tokenomics is already a word used to describe cryptocurrency economics, not sure why they'd try to redefine it for AI even if a different sort of token is used.
Tokenomics had been already used by marijuana enthusiasts for a long time.
cryptocurrency economics = cryptonomics
You're welcome! =)
Neal Stephenson wrote a book about it.
New fad. Forget about the old fad. This one will be old soon, you better get on board before its too late!
Crypto was already a term before cryptocurrencies made it about them. Web 3.0 was already a thing before crypto bros made web 3 about cryptocurrencies.
So what? Terms are reused in different contexts all the time. And most people have moved on from cryptocurrencies anyway, so there’s little chance it’ll confuse anyone.
At its current iteration the AI tech market is not economically sustainable, not for the other markets outside the AI economy, and most deadly not even for the main target customers or AI tech companies themselves. There have been several news of companies having overspent their token budget month after month. The hardware monopolist and his network of buddy companies can determine the token price as freely as they want, there are no competitors, their only "competitor" is when people stop using AI alltogether.
I don't think business is interested in any sustainability of anything. There's zero incentives for that for anyone.
[dead]
[flagged]
[flagged]
[flagged]
[flagged]
[dead]
[dead]
[dead]
[flagged]