Comment by lumost
1 day ago
I actually can’t wait for the future where I upgrade hardware in order to upgrade my ai as an alternative to an expensive subscription.
There are many problems I want to work on which require billions of tokens. These are completely inaccessible without corporate project sponsorship at the moment. An asic generation machine which can pump out a few 10s of thousands of tokens per second at opus4.6 quality is more than sufficient.
A company called Taalas is working on something like that. Not Opus4.6 quality, but I'm sure they're targeting larger models. Currently they're using a LLama 8B model. It runs at ~17k tokens per second, and you can test it at https://chatjimmy.ai/.
I'm rooting for them HARD but they've been quiet since their last (and only) blog. X and LinkedIn are empty too. I really hope it wasn't a pipe dream.
It starts to be interesting when latency is better than average website.
I’m not sure if this is what you meant, but at 17k t/s, you start to compete with the speed of network calls. You could approach the point of generating an HTML/js/css page faster than some websites can be returned over the network.
The immediate load (less than 200ms on my machine through a slow connection) is quite pleasant, tbh.
That's cool, I just tested it out and it is fast but unfortunately its accuracy is not great.
It's an 8B model. Consider it a proof-of-concept.
Round Robin the free tier APIs, should be effectively free. Just say say “sike” if discussing sensitive issues so the LLM never flags you.
I'm curious how hardware and power cost would stack up to subscription cost
Right now - there's some heavily subsidized subscriptions that are more or less cheating. For instance, Github CoPilot at $39/month gives you claude opus 4.6. They're going to close that off, but right now it's like a freebie for those doing API agentic harnesses.
That said, if you are doing always on agents and you spend $3k-$4k on a GB10 or, $5+ k on Apple Silicon as your sunk cost, you will probably come out ahead.
I've got 5 agents running a purely experimental social experiment. AThey operate in an evennia mud (a familiar sounding city called "gothmud). I've built a channel, idle prompts, sleep schedule. I feed in real world news, weather. There's a character up in a clock tower that reads evennia's audit logs every 20 minutes to surveil the city, and a cast of people wandering around, investigating things, having coffee, repairing robots. This is all hitting qwen3.6-35-A3B on the Asus GB10, which cost me $3k.
Over the last 30 days, I've hit 394M input tokens, 1.6B output tokens. I would have spent between $1600 to $1700 if I was using openrouter. Not calculated - I also have comfyui running in the spare space, and the agents "take photos" of the rooms they're in, selfies, workshop photos, etc.
How much did I spend on electricity? I don't have a meter on my box. My total electric bill for the last 30 days was $220, so I know it's less than that. My rate to compare is 11.7/kwh, but it's closer to 15c/Kwh total. The Asus GX10 has a 240W power supply, and it's probably only pulling 180. I estimate $15-$20/month. But worst case red-lining. 240 Watts, 720 hours = 172KWH , and at $0.20, I come to $35
Here's the kicker thought - that github copilot subscription I mentioned? I have another agent running on that, reading all my other agent logs, managing my obsidian notes, doing research, sending briefings. And all by itself, it used almost the same amount of claude-opus tokens for that $39/month subscription. I was actually a bit shocked when I pulled a recent report and saw that. I'm working to migrate functionality away from copilot subscription to the local model. A lot of the initial setup might have needed it, but not the ongoing review style work it does.
Have you learned anything interesting from your agent ant farm?
2 replies →
> experiment
What is the experiment? What are you hoping to learn from all this?
Or do you just mean you've made a dynamic dollhouse that you think is cool? The Sims on your own terms?
2 replies →
For open models, usually not well. You get 5+ providers competing on cost, all with cheaper electricity and better hardware utilization than your local setup
I did an estimate of that if you're interested: https://x.com/pwnies/status/2028831699736637912
The TL;DR though is that a 10-15b param model baked into an ASIC with the latest fab tech would take around 62W of power draw when active. At ~10k+ t/s though it likely would only be active for short bursts of time. It'd fit perfectly fine within the thermal envelope of a laptop.
The approach makes a lot of sense. Once you get to those speeds, latency of the network becomes one of the bigger bottlenecks, so local has a real advantage over a subscription.
You're not counting the capex which could be the same cost as 5-10 years of Claude.
2 replies →
Is latency of the network that noticeable? Aren’t we talking low hundreds of ms at worst here? Much lower for something close regionally.
Can you give an example of such a problem?
"Design me a 3d printable rocket engine for a hobby rocket project. Verify it's design in a full simulation. Iterate until it works reliably in simulation based on a verified printable design on a consumer laser sintering device (or substitute contract manufacture for under 1000 dollars)."
This is a hobby version of a project, but you can imagine commercial versions of the same prompt for new databases, genomics studies, material analysis, operating systems etc.
From the prompt it seems evident the envisioned user doesn't have an interest in designing the motor themselves, so why not simply buy a stock motor?
2 replies →
You almost certainly do not want an LLM to do that. Leap71 actually has computational models generating functional rocket engines that way. You could absolutely wrap a tool like that in a shell and handle control with an LLM and not need anywhere near the tokens.
Thats the thing - these models see and predict tokens. For any real engineering you get more bang for your buck using math.
Verifying how the model works against the real world is the difficult, dangerous bit.
There might be some interesting side effects from making simulation software, which is currently either an expensive niche or quirky university project (SPICE always has that feel).
I’m not convinced at all that the model won’t just get stuck in a loop where it doesn’t understand how to fix the broken rocket. I see similar failure modes in far simpler projects strictly confined to coding. This feels closer to “make me a profitable business, make no mistakes” than to a simple coding project.
Are there already skills around modelling, simulation and post-processing? Any pointers?
Stop it, you tease. I'm getting a little tingly
Decompiling a binary and recreating the source, doing a full line-by-line security audit, always-on agents monitoring state minute-by-minute, etc.
I would very easily find ways to hit that level of token usage if it was cheaper/faster.
Not OP but if I had a couple RTX 6000 I'd throw them at decompiling bloodborne to play on PC without emulation.
Ok heres the thing you will nevwr be able to truly do this due to logic.
Logically five people pooling their resources beats one guy.
therefore datacenters will always win because they get higher time utilization.
so forget it.
I always wonder the same but i let logic tell me its a fantasy, on average you cant outspend a whole group of people making better use of the hardware.
you will get better hardware though, cutting edge will always be cloud
Laptops/desktops are cheaper per flop than any datacenter hardware by a good order of magnitude.
The problem is that expectations rise in datacenters, hardware/power/security/availability guarantees cost real money. Then the operator providing these guarantees expects some margin.
You can see this most clearly with "developer desktops", a gcp instance costs about 10x a hetzner instance which costs between 5 and 10x the same hardware sitting in the back of an office somewhere. While all of these premiums matter for 24/7 systems under active development, they don't really matter for ephemeral small scale workloads.
Doesn’t it flip around for small scale? Paying 100x the cost for something, all in, it’s cheaper to rent for small workloads like 10m/day.
At 10x you have to be at hours per day and 5x you’re at 4h.
Actually they wouldnt spend the money if it were cheaper.
HBM has way higher bandwidth and its not all about flops.
Also the FP4 flops (inference) are so mind bogglingly high on these things.
Lastly what you fail to consider is the chip to chip bandwidth which is critical.
the people running these know that networking is just as critical.
all reduce etc.
they wouldnt pay if they could get something better value.
Just like cloud is "cheaper" than colo/metal, right?
> cutting edge will always be cloud
Don't think anyone was refuting that?
And of course when you pool resources you have access to more resources.
They just mean this part: "where I upgrade hardware in order to upgrade my ai as an alternative to an expensive subscription."
Upgrading local hardware will remain the more expensive alternative to the subscription regardless what the relative cost of running the models themselves are. If the local hardware to do so becomes affordable then the subscription will be even more affordable, not expensive.
At least for these kinds of mega tasks. For more micro task we will always end up with unutilized local compute we already purchased which will be "free" since we already paid for non-AI reasons (e.g. a gaming GPU while not gaming).
> so forget it.
Which explains why you're using a dumb terminal to access compute services?
Basically, yes. We are on a website, after all.
1 reply →
Where I think you're wrong is that everything in technology has been cyclical, it's just a matter of time.