← Back to context

Comment by chadash

3 months ago

> Will LLMs be cheaper than humans once the subsidies for tokens go away? At this point we have little visibility to what the true cost of tokens is now, let alone what it will be in a few years time. It could be so cheap that we don’t care how many tokens we send to LLMs, or it could be high enough that we have to be very careful.

We do have some idea. Kimi K2 is a relatively high performing open source model. People have it running at 24 tokens/second on a pair of Mac Studios, which costs 20k. This setup requires less than a KW of power, so the $0.8-0.15 being spent there is negligible compared to a developer. This might be the cheapest setup to run locally, but it's almost certain that the cost per token is far cheaper with specialized hardware at scale.

In other words, a near-frontier model is running at a cost that a (somewhat wealthy) hobbyist can afford. And it's hard to imagine that the hardware costs don't come down quite a bit. I don't doubt that tokens are heavily subsidized but I think this might be overblown [1].

[1] training models is still extraordinarily expensive and that is certainly being subsidized, but you can amortize that cost over a lot of inference, especially once we reach a plateau for ideas and stop running training runs as frequently.

> a near-frontier model

Is Kimi K2 near-frontier though? At least when run in an agent harness, and for general coding questions, it seems pretty far from it. I know what the benchmarks say, they always say it's great and close to frontier models, but is this other's impression in practice? Maybe my prompting style works best with GPT-type models, but I'm just not seeing that for the type of engineering work I do, which is fairly typical stuff.

  • I’ve been running K2.5 (through the API) as my daily driver for coding through Kimi Code CLI and it’s been pretty much flawless. It’s also notably cheaper and I like the option that if my vibe coded side projects became more than side projects I could run everything in house.

    I’ve been pretty active in the open model space and 2 years ago you would have had to pay 20k to run models that were nowhere near as powerful. It wouldn’t surprise me if in two more years we continue to see more powerful open models on even cheaper hardware.

    • I agree with this statement. Kimi K2.5 is at least as good as the best closed source models today for my purposes. I've switched from Claude Code w/ Opus 4.5 to OpenCode w/ Kimi K2.5 provided by Fireworks AI. I never run into time-based limits, whereas before I was running into daily/hourly/weekly/monthly limits all the time. And I'm paying a fraction of what Anthropic was charging (from well over $100 per month to less than $50 per month).

      5 replies →

    • Depends what you see as flawless. From my perspective even GPT 5.2 produces mostly garbage grade code (yes it often works, but it is not suitable for anywhere near production) and takes several iterations to get it to remotely workable state.

      7 replies →

  • regardless its been 3 years since the release of chatgpt. literally 3. imagine in just 5 more years how much low hanging (or even big breakthroughs) will get into the pricing, things like quantization, etc. no doubt in my mind the question of "price per token" will head towards 0

You don't even need to go this expensive. An AMD Ryzen Strix Halo (AI Max+ 395) machine with 128 GiB of unified RAM will set you back about $2500 these days. I can get about 20 tokens/s on Qwen3 Coder Next at an 8 bit quant, or 17 tokens per second on Minimax M2.5 at a 3 bit quant.

Now, these models are a bit weaker, but they're in the realm of Claude Sonnet to Claude Opus 4. 6-12 months behind SOTA on something that's well within a personal hobby budget.

  • I was testing the 4-bit Qwen3 Coder Next on my 395+ board last night. IIRC it was maintaining around 30 tokens a second even with a large context window.

    I haven't tried Minimax M2.5 yet. How do its capabilities compare to Qwen3 Coder Next in your testing?

    I'm working on getting a good agentic coding workflow going with OpenCode and I had some issues with the Qwen model getting stuck in a tool calling loop.

    • I've literally just gotten Minimax M2.5 set up, the only test I've done is the "car wash" test that has been popular recently: https://mastodon.world/@knowmadd/116072773118828295

      Minimax passed this test, which even some SOTA models don't pass. But I haven't tried any agentic coding yet.

      I wasn't able to allocate the full context length for Minimax with my current setup, I'm going to try quantizing the KV cache to see if I can fit the full context length into the RAM I've allocated to the GPU. Even at a 3 bit quant MiniMax is pretty heavy. Need to find a big enough context window, otherwise it'll be less useful for agentic coding. With Qwen3 Coder Next, I can use the full context window.

      Yeah, I've also seen the occasional tool call looping in Qwen3 Coder Next, that seems to be an easy failure mode for that model to hit.

      3 replies →

  • It is crazy to me that it is that slow, 4 bit quants don't lose much with Qwen3 coder next and unsloth/Qwen3-Coder-Next-UD-Q4_K_XL gets 32 tps with a 3090 (24gb) as a VM with 256k context size with llama.cpp

    Same with unsloth/gpt-oss-120b-GGUF:F16 gets 25 tps and gpt-oss20b gets 195 tps!!!

    The advantage is that you can use the APU for booting, and pass through the GPU to a VM, and have nice safer VMs for agents at the same time while using DDR4 IMHO.

    • Yeah, this is an AMD laptop integrated GPU, not a discrete NVIDIA GPU on a desktop. Also, I haven't really done much to try tweaking performance, this is just the first setup I've gotten that works.

      2 replies →

  • If you don't mind saying, what distro and/or Docker container are you using to bet Qwen3 Coder Next going?

    • I'm running Fedora Silverblue as my host OS, this is the kernel:

        $ uname -a
        Linux fedora 6.18.9-200.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Feb  6 21:43:09 UTC 2026 x86_64 GNU/Linux
      

      You also need to set a few kernel command line paramters to set it up to allow it to use most of your memory as graphics memory, I have the following in my kernel command line, those are each 110 GiB expressed in number of pages (I figure leaving 18 GiB or so for CPU memory is probably a good idea):

        ttm.pages_limit=28835840 ttm.page_pool_size=28835840
      

      Then I'm running llama.cpp in the official llama.cpp Docker containers. The Vulkan one works out of the box. I had to build the container myself for ROCm, the llama.cpp container has ROCm 7.0 but I need 7.2 to be compatible with my kernel. I haven't actually compared the speed directly between Vulkan and ROCm yet, I'm pretty much at the point where I've just gotten everything working.

      In a checkout of the llama.cpp repo:

        podman build -t llama.cpp-rocm7.2 -f .devops/rocm.Dockerfile --build-arg ROCM_VERSION=7.2 --build-arg ROCM_DOCKER_ARCH='gfx1151' .
      

      Then I run the container with something like:

        podman run -p 8080:8080 --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined --security-opt label=disable --rm -it -v ~/.cache/llama.cpp/:/root/.cache/llama.cpp/ -v ./unsloth:/app/unsloth llama.cpp-rocm7.2  --model unsloth/MiniMax-M2.5-GGUF/UD-Q3_K_XL/MiniMax-M2.5-UD-Q3_K_XL-00001-of-00004.gguf --jinja --ctx-size 16384 --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --port 8080 --host 0.0.0.0 -dio
      

      Still getting my setup dialed in, but this is working for now.

      Edit: Oh, yeah, you had asked about Qwen3 Coder Next. That command was:

        podman run -p 8080:8080 --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined --security-opt label=disable \
          --rm -it -v ~/.cache/llama.cpp/:/root/.cache/llama.cpp/ -v ./unsloth:/app/unsloth llama.cpp-rocm7.2  -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q6_K_XL \
          --jinja --ctx-size 262144 --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --port 8080 --host 0.0.0.0 -dio
      

      (as mentioned, still just getting this set up so I've been moving around between using `-hf` to pull directly from HuggingFace vs. using `uvx hf download` in advance, sorry that these commands are a bit messy, the problem with using `-hf` in llama.cpp is that you'll sometimes get surprise updates where it has to download many gigabytes before starting up)

20k for such a setup for a hobbyist? You can leave the somewhat away and go into sub 1% region globally. A kw of power is still 2k/year at least for me, not that I expect it will run continuously but still not negligible if you can do with 100-200 a year on cheap subscriptions.

  • There are plenty of normal people with hobbies that cost much more. Off the top of my head, recreational vehicles like racecars and motorcycles, but im sure there are others.

    You might be correct when you say the global 1%, but that's still 83 million people.

    • I used to think photography was an expensive hobby until my wife got back into the horse world.

Reminder to others that $20k is the one time startup cost, and is amortized perhaps 2-4k/year (plus power). That is in the realm of a mere gym membership around me for a family

  • So 5-10 years to amortize the cost. You could get 10 years of Claude Max and your $20k could stay in the bank in case the robots steal your job or you need to take an ambulance ride in the US.

> And it's hard to imagine that the hardware costs don't come down quite a bit.

have you paid any attention to the hardware situation over the last year?

this week they've bought up the 2026 supply of disks

> a cost that a (somewhat wealthy) hobbyist can afford

$20,000 is a lot to drop on a hobby. We're probably talking less than 10%, maybe less than 5% of all hobbyists could afford that.

  • You can rent computer from someone else to majorly reduce the spend. If you just pay for tokens it will be cheaper than buying the entire computer outright.

  • Up front, yeah. But people with hobbies on the more expensive end can definitely put out 4k a year. Im thinking like people who have a workshop and like to buy new tools and start projects.

90% of companies would go bankrupt in a year if you replaced their engineering team with execs talking to k2...

  • Most execs I've worked with couldn't tell their engineering team what they wanted with any specificity. That won't magically get any better when they talk to an LLM.

    If you can't write requirements an engineering team can use, you won't be able to write requirements for the robots either.

Horrific comparison point. LLM inference is way more expensive locally for single users than running batch inference at scale in a datacenter on actual GPUs/TPUs.

  • How is that horrific? It sets an upper bound on the cost, which turns out to be not very high.

If I remember correctly Dario had claimed that AI inference gross profit margins are 40%-50%

  • Why do you people trust what he has to say? Like omg dude. These folks play with numbers all the time to suit their narrative. They are not independently audited. What do you think scares them about going public? Things like this. They cannot massage the numbers the same way they do in the private market.

    The naivete on here is crazy tbh.

    • Pretty poor narrative tbh. As things stand they will not be profitable unless stop developing new models or get to AGI. So very likely never.