Comment by egorfine
1 day ago
Thanks for the article! I have a beefy M5 Pro and I'm eagerly looking around for ways to use local models (specifically Gemma4 & Qwen3.6).
This is an excellent thing to do. Especially that LLMs excel at batching thus you can index multiple photos and videos in parallel for no performance penalty.
Unsloth Studio [0] is what I recommend these days, open source alternative to the more widely known LM Studio, and also built by the people who make good quantizations of released models. With MTP support not merged in you should get 2x token generation speed with no accuracy difference. They also have MLX quants if you scroll down a bit, which is a format specifically for macOS' Metal GPU acceleration but that's not integrated into Unsloth Studio just yet.
[0] https://unsloth.ai/docs/models/qwen3.6#mtp-guide
I have researched for quite a bit and so far the fastest runtime is the oMLX one. But there's a caveat: ttft on MLX on M4 Pro is enormous. On M5 Pro it has been greatly sped up.
Curious if you tested llama.cpp and still found oMLX faster? I haven't tried the latter myself, might give it a go.
1 reply →
I tried Unsloth Studio recently and was disappointed - in particular the downloading functionality is half-baked and didn’t cope with resuming downloads. As it seemed to just be a simple wrapper over llama.cpp, I found that huggingface hub, llama.cpp, and a couple of simple scripts actually offered better functionality once it was set up.
Yeah it still has some issues on the UX side. It works fine resuming though, just select the same model again and it'll resume the download, the only issue is there isn't a dedicated download page as that would help a lot.
What's better about Unsloth Studio vs LM Studio is it tells you exactly what quantization to use especially as Unsloth ones are quite good, and that it has web search and self-healing tool calls so having a web-searching local ChatGPT alternative is very easy to spin up.
Thanks! Videos is still kinda new to me. But I have a large collection of amazing photos - tens of thousands of RAW images - just lying there spread across the different trip folders.
You know what I REALLY want? Just point this beast at the folders and it tell me which 150 shots are good to process from these 1,500 images. That's the dream!
Although the technology is getting there, it's still a very difficult problem to solve. Taste and art is subjective. Also me as a photographer will always be concerned - "what if my best shot was in one of these rejected shots".
But yeah, I think I'll try to do some more of these experiments soon.
there’s a lot of open models out there… I told Claude to do a weighted score on several models and deduplicate by CLIP similarity for an expedition, should be easy to replicate (see below). Sure doesn’t select the absolute best pics from an emotional impact perspective, but it was pretty damn good at me not having to wade through the bottom 80% of mediocre shots and dupes!
—-
“Models scored all 4,487 photos. NIMA rewards technical craft (sharpness, composition), LAION rewards emotional/aesthetic appeal, MUSIQ is more general quality. Combined: 0.4 NIMA + 0.3 LAION + 0.3 MUSIQ, deduped at 0.85 CLIP similarity.
Interesting: the models wildly disagreed on some shots — one photo ranked NIMA #2 globally but LAION #4313.”
Very interesting! Wasn't aware of these. I'll be exploring them soon. Thanks
I have been contemplating a M5 Pro MBP, but for the life for me I wasn't able to find benchmarks for real-world models, do you happen to know how many tokens per second roughly you get with MoE models like Qwen 3.6 35B/A3B or Gemma 4 26B?
I'm not normally one to share videos as answers, but this particular fellow does a LOT of work with local AIs and Macs and happens to have a nuanced answer. https://youtu.be/XGe7ldwFLSE
You need to ask macOS people for their prefill speed as well, there are two numbers you care about here, and current MacBooks have generally terrible numbers when it comes to prefill performance. Surely it'll get better with time, but if you already have a desktop, I'd go the "beefy GPU" route first.
Qwen 3.6 35B running on oMLX 0.3.9rc1: on oMLX I get 86 t/s on Q4 and 74 t/s on Q6.
Bear in mind that ttft on MLX is much much faster on M5 Pro as compared to M4 Pro.
Also bear in mind that those figures are with NO optimizations whatsoever: no MCP, no DFlash. I am waiting for both to be released for the Qwen models.
Great, thanks! :-) and to mirror another poster: what kind of prompt parsing (prefill) speed do you get for that model? Also how is the speed for the 27B model?
2 replies →
Qwen3.6 27B oQ6: 12.5 t/s generation, 340-360 t/s pp.
Native MCP:
For Qwen 35B enabling native MCP on MLX models slows it down by 10%.
For Qwen 27B enabling native MCP on MLX models speeds token generation up almost exactly 1.5x.
(all tested on M5 pro).
[dead]
I'm running unsloth/Qwen3.6-35B-A3B-UD-Q8_K_XL on an M3 Max, 64GB at ~57 t/s with llama-server
Prefill speed and 27B number?