Comment by originalvichy

6 hours ago

Yup! Smaller quants will fit within 24GB but they might sacrifice context length.

I’m excited to try out the MLX version to see if 32GB of memory from a Pro M-series Mac can get some acceptable tok/s with longer context. HuggingFace has uploaded some MLX versions already.

2 comments

originalvichy

donmcronald 5 hours ago

I have an Mini M4 Pro with 64GB of 273GB/s memory bandwidth and it's borderline with 3.5-27B. I assume this one is the same. I don't know a ton, but I think it's the memory bandwidth that limits it. It's similar on a DGX Spark I have access to (almost the same memory bandwidth).

It's been a while since I tried it, but I think I was getting around 12-15 tokens per second an that feels slow when you're used to the big commercial models. Whenever I actually want to do stuff with the open source models, I always find myself falling back to OpenRouter.

I tried Intel/Qwen3.6-35B-A3B-int4-AutoRound on a DGX Spark a couple days ago and that felt usable speed wise. I don't know about quality, but that's like running a 3B parameter model. 27B is a lot slower.

I'm not sure if I "get" the local AI stuff everyone is selling. I love the idea of it, but what's the point of 128GB of shared memory on a DGX Spark if I can only run a 20-30GB model before the slow speed makes it unusable?

ycui1986 6 hours ago

32GB RAM on mac also need to host OS, software, and other stuff. There may not even be 24GB VRAM left for the model.