Comment by egorfine

1 day ago

Thanks for the article! I have a beefy M5 Pro and I'm eagerly looking around for ways to use local models (specifically Gemma4 & Qwen3.6).

This is an excellent thing to do. Especially that LLMs excel at batching thus you can index multiple photos and videos in parallel for no performance penalty.

21 comments

egorfine

satvikpendem 1 day ago

Unsloth Studio [0] is what I recommend these days, open source alternative to the more widely known LM Studio, and also built by the people who make good quantizations of released models. With MTP support not merged in you should get 2x token generation speed with no accuracy difference. They also have MLX quants if you scroll down a bit, which is a format specifically for macOS' Metal GPU acceleration but that's not integrated into Unsloth Studio just yet.

[0] https://unsloth.ai/docs/models/qwen3.6#mtp-guide

egorfine 1 day ago
I have researched for quite a bit and so far the fastest runtime is the oMLX one. But there's a caveat: ttft on MLX on M4 Pro is enormous. On M5 Pro it has been greatly sped up.
- regexorcist 1 day ago
  
  Curious if you tested llama.cpp and still found oMLX faster? I haven't tried the latter myself, might give it a go.
  
  1 reply →
mft_ 1 day ago
I tried Unsloth Studio recently and was disappointed - in particular the downloading functionality is half-baked and didn’t cope with resuming downloads. As it seemed to just be a simple wrapper over llama.cpp, I found that huggingface hub, llama.cpp, and a couple of simple scripts actually offered better functionality once it was set up.
- satvikpendem 18 hours ago
  
  Yeah it still has some issues on the UX side. It works fine resuming though, just select the same model again and it'll resume the download, the only issue is there isn't a dedicated download page as that would help a lot.
  What's better about Unsloth Studio vs LM Studio is it tells you exactly what quantization to use especially as Unsloth ones are quite good, and that it has web search and self-healing tool calls so having a web-searching local ChatGPT alternative is very easy to spin up.

asenna 1 day ago

Thanks! Videos is still kinda new to me. But I have a large collection of amazing photos - tens of thousands of RAW images - just lying there spread across the different trip folders.

You know what I REALLY want? Just point this beast at the folders and it tell me which 150 shots are good to process from these 1,500 images. That's the dream!

Although the technology is getting there, it's still a very difficult problem to solve. Taste and art is subjective. Also me as a photographer will always be concerned - "what if my best shot was in one of these rejected shots".

But yeah, I think I'll try to do some more of these experiments soon.

endymi0n 1 day ago
there’s a lot of open models out there… I told Claude to do a weighted score on several models and deduplicate by CLIP similarity for an expedition, should be easy to replicate (see below). Sure doesn’t select the absolute best pics from an emotional impact perspective, but it was pretty damn good at me not having to wade through the bottom 80% of mediocre shots and dupes!
—-
“Models scored all 4,487 photos. NIMA rewards technical craft (sharpness, composition), LAION rewards emotional/aesthetic appeal, MUSIQ is more general quality. Combined: 0.4 NIMA + 0.3 LAION + 0.3 MUSIQ, deduped at 0.85 CLIP similarity.
Interesting: the models wildly disagreed on some shots — one photo ranked NIMA #2 globally but LAION #4313.”
- asenna 20 hours ago
  
  Very interesting! Wasn't aware of these. I'll be exploring them soon. Thanks

busfahrer 1 day ago

I have been contemplating a M5 Pro MBP, but for the life for me I wasn't able to find benchmarks for real-world models, do you happen to know how many tokens per second roughly you get with MoE models like Qwen 3.6 35B/A3B or Gemma 4 26B?

ahknight 1 day ago

I'm not normally one to share videos as answers, but this particular fellow does a LOT of work with local AIs and Macs and happens to have a nuanced answer. https://youtu.be/XGe7ldwFLSE
embedding-shape 1 day ago

You need to ask macOS people for their prefill speed as well, there are two numbers you care about here, and current MacBooks have generally terrible numbers when it comes to prefill performance. Surely it'll get better with time, but if you already have a desktop, I'd go the "beefy GPU" route first.
egorfine 1 day ago
Qwen 3.6 35B running on oMLX 0.3.9rc1: on oMLX I get 86 t/s on Q4 and 74 t/s on Q6.
Bear in mind that ttft on MLX is much much faster on M5 Pro as compared to M4 Pro.
Also bear in mind that those figures are with NO optimizations whatsoever: no MCP, no DFlash. I am waiting for both to be released for the Qwen models.
- busfahrer 1 day ago
  
  Great, thanks! :-) and to mirror another poster: what kind of prompt parsing (prefill) speed do you get for that model? Also how is the speed for the 27B model?
  
  2 replies →
egorfine 1 day ago

Qwen3.6 27B oQ6: 12.5 t/s generation, 340-360 t/s pp.
egorfine 1 day ago
Native MCP:
For Qwen 35B enabling native MCP on MLX models slows it down by 10%.
For Qwen 27B enabling native MCP on MLX models speeds token generation up almost exactly 1.5x.
(all tested on M5 pro).
- mlvljr 1 day ago
  
  [dead]
juancn 1 day ago
I'm running unsloth/Qwen3.6-35B-A3B-UD-Q8_K_XL on an M3 Max, 64GB at ~57 t/s with llama-server
- brcmthrowaway 1 day ago
  
  Prefill speed and 27B number?