Comment by mips_avatar

3 months ago

Have you tried any really big models on a mac studio? I'm wondering what latency is like for big qwens if there's enough memory.

Not yet with MetalRT, right now we support models up to ~4B parameters (Qwen3 4B, Llama 3.2 3B, LFM2.5 1.2B). These are optimized for the voice pipeline use case where decode speed and latency matter more then model size.

Expanding to larger models (7B, 14B, 32B) on machines with more unified memory is on the roadmap. The Mac Studio with 192GB would be an interesting target, a 32B model at 4-bit would fit comfortably and MetalRT's architectural advantages (fused kernels, minimal dispatch overhead) should scale well.

What model / use case are you thinking about? That helps us prioritize.

  • Well it’s just more that I’ve noticed in the agents I’ve built that qwen doesn’t get reliable until around 27b so unless you want to rl small qwen I don’t think I would get much useful help out of it.

    • That tracks with what we've seen too. For agent workflows with reliable tool calling, you really do need the larger models. Larger model support is a priority for us. Thanks for the data point.

I am running 80b Qwen coder next 4bit quant MLX version on a 96GB M3 MacBook and it responds quickly, almost immediately. I can fit the model + 128k context comfortably into the memory