Comment by sanchitmonga22
3 months ago
Not yet with MetalRT, right now we support models up to ~4B parameters (Qwen3 4B, Llama 3.2 3B, LFM2.5 1.2B). These are optimized for the voice pipeline use case where decode speed and latency matter more then model size.
Expanding to larger models (7B, 14B, 32B) on machines with more unified memory is on the roadmap. The Mac Studio with 192GB would be an interesting target, a 32B model at 4-bit would fit comfortably and MetalRT's architectural advantages (fused kernels, minimal dispatch overhead) should scale well.
What model / use case are you thinking about? That helps us prioritize.
Well it’s just more that I’ve noticed in the agents I’ve built that qwen doesn’t get reliable until around 27b so unless you want to rl small qwen I don’t think I would get much useful help out of it.
That tracks with what we've seen too. For agent workflows with reliable tool calling, you really do need the larger models. Larger model support is a priority for us. Thanks for the data point.