Comment by 650REDHAIR

5 days ago

I’ll keep my data local over a $.02/mtok difference.

9 comments

650REDHAIR

It’s more than just data locality. OpenRouter is faster, no? I have an M4 pro, and anything but the smallest dumbest models are unusably slow for interactive use. I personally haven’t yet found a good use case for offline/non-interactive LLM work locally.

datadrivenangel 5 days ago
Yeah. The speed is the biggest issue. The intelligence of open models is good enough for serious work (though still worse than the frontier models), but the cloud models are often 3-7 times faster, and you can get more parallelization and so get speeds on the order of hundreds of tokens per second, which makes things fast!
- freeopinion 5 days ago
  
  Even extremely slow LLMs can generate Part B faster than I can audit Part A. So the LLM can generate Part A while I look over my email. Then it can worry over Part B while I look over Part A.
  It can worry over Part C while I have my 10:30 group meet. And it can worry over Part D while I do whatever other silly, time-wasting thing all humans do in almost all organizations. Then I still haven't reviewed Part B, yet, so the extremely slow AI is waiting on me.
  Maybe someday I'll be good enough to need faster AI so I can rewrite something like Bun in a few days. Right now, slow and local fits my use case very well.
  
  2 replies →
threatofrain 5 days ago

And continuing the argument of "more than just...", if you stopped inferencing on your Mac you still have a generally nice computer. The difference between rent vs buy.
novok 5 days ago

I played with classifying and summarizing my entire email history (per email) with small models, but that only took about 12h of GPU time at most. Using a coding agent cli wrapper in that case is far slower because of all the spin up cost and the system prompt they inject even if you want to turn it all off.
If I used an actual direct API it probably would've been much faster, but I'm doing it for hobby / fun reasons. You also get to fiddle with a lot more params.
PAndreew 5 days ago
I’m running a local Whisper + Gemma 4 pipeline with a cheap USB mic to extract health related data and potential todos from ambient speech. It doesn’t have to be fast doesn’t have to be 100% correct because if it captures at least a few bits of interesting information that would otherwise go unnoticed it’s still a win.
- 650REDHAIR 5 days ago
  
  I run whisper through openwebui to gemma4 moe and use kokoro TTS back to me.
  I use a 5060ti 16gb and a minipc.
  I tunnel in via Tailscale and access it with my phone or laptop from anywhere. It’s pretty good and will only get better as I optimize.