Comment by bbshfishe

4 hours ago

Yeah no it didn’t. If you have a fully speced out M3/4 MacBook with enough memory you’re running pretty decent models locally already. But no one is using local models anyway.

I run a local model on the daily. I have it making tickets when certain emails come in and made a small that I can click to approve ticket creation. It follows my instructions and has a nice chain of thought process trained. Local LLMs are starting to become very useful. Not OpenClaw crap.

  • What vram you running to allow both a capable model to run and also everything else the device needs to run?

With OpenClaw and powerful local models like Kimi 2.5, these specs make a lot of sense.

  • K2.5 isn't remotely a local model

    • Oh sure it is! I’ve helped set up an AI cluster rack with four K2.5s.

      With some custom tooling, we built our own local enterprise setup:

      Support ticketing system Custom chat support powered by our trained software-support model Resolved repository with detailed step-by-step instructions User-created reports and queries Natural language-driven report generation (my favorite — no more dragging filters into the builder; our (Secret) local model handles it for clients) In-application tools (C#/SQL/ASP.NET) to support users directly, since our software runs on-site and offline due to PPI A cool repair tool: import/export “support file packet patcher” that lets us push fixes live to all clients or target niche cases Qwen3 with LoRA fine-tuning is also incredible — we’re already seeing great results training our own models.

      There’s a growing group pushing K2.5s to run on consumer PCs (with 32GB RAM + at least 9GB VRAM) — and it’s looking very promising. If this works, we’ll be retooling everything: our apps and in-house programs. Exciting times ahead!

    • Technically you can get most MoE models to execute locally because RAM requirements are limited to the active experts' activations (which are on the order of active param size), everything else can be either mmap'd in (the read-only params) or cheaply swapped out (the KV cache, which grows linearly per generated token and is usually small). But that gives you absolutely terrible performance because almost everything is being bottlenecked by storage transfer bandwidth. So good performance is really a matter of "how much more do you have than just that bare minimum?"

    • of course it's not remotely local: remote and local are literally antonyms