Comment by pants2

11 hours ago

This really highlights the impracticality of local models:

My $3k Macbook can run `GPT-OSS 20B` at ~16 tok/s according to this guide.

Or I can run `GPT-OSS 120B` (a 6X larger model) at 360 tok/s (30X faster) on Groq at $0.60/Mtok output tokens.

To generate $3k worth of output tokens on my local Mac at that pricing it would have to run 10 years continuously without stopping.

There's virtually no economic break-even to running local models, and no advantage in intelligence or speed. The only thing you really get is privacy and offline access.

5 comments

pants2

danny_codes 11 hours ago

A million tokens is like 5 minutes of inference for heavy coding use.

girvo 10 hours ago
At work I regularly hit my 7.5mil tokens per hour limit one of our tools has, and have to switch model of tool, and I’m not even really a remotely heavy user. I think people don’t realise how many tokens get burned with CoT and tool calls these days
At 7.5mil per hour hard limit, 84 days to hit the grandparents $3k
That said local models really are slow still, or fast enough and not that great
- reverius42 5 hours ago
  
  They already stated they can only generate 57,600 tokens per hour locally (expressed as 16 tokens per second). So that's the limiting factor here.

xandrius 10 hours ago

You're saying it as if privacy was worthless? Also not many people would consider the price of buying a macbook and put it strictly towards running a local model.

Instead if you wanted to get a macbook anyway, you get to run local models for free on top. Very different story.

pants2 10 hours ago

The privacy angle is not that interesting to me.
- You can find inference providers with whatever privacy terms you're looking for
- If you're using LLMs with real data (let's say handling GMail) then Google has your data anyway so might as well use Gemini API
- Even if you're a hardcore roll-your-own-mail-server type, you probably still use a hosted search engine and have gotten comfortable with their privacy terms
Also on cost the point is you can use an API that's many times smarter and faster for a rounding error in cost compared to your Mac. So why bother with local except for the cool factor?