Comment by terhechte

7 months ago

The (smaller) Scout model is really attractive for Apple Silicon. It is 109B big but split up into 16 experts. This means that the actual processing happens in 17B. Which means responses will be as fast as current 17B models. I just asked a local 7B model (qwen 2.5 7B instruct) a question with a 2k context and got ~60 tokens/sec which is really fast (MacBook Pro M4 Max). So this could hit 30 token/sec. Time to first token (the processing time before it starts responding) will probably still be slow because (I think) all experts have to be used for that.

In addition, the model has a 10M token context window, which is huge. Not sure how well it can keep track of the context at such sizes, but just not being restricted to ~32k is already great, 256k even better.

66 comments

terhechte

refibrillator 7 months ago

> the actual processing happens in 17B

This is a common misconception of how MoE models work. To be clear, 17B parameters are activated for each token generated.

In practice you will almost certainly be pulling the full 109B parameters though the CPU/GPU cache hierarchy to generate non-trivial output, or at least a significant fraction of that.

vessenes 7 months ago

I agree the OP’s description is wrong. That said, I think his conclusions are right, in that a quant of this that fits in 512GB of RAM is going to run about 8x faster than a quant of a dense model that fits in the same RAM, esp. on Macs as they are heavily throughput bound.
p12tic 7 months ago
For all intents and purposes cache may not exist when the working set is 17B or 109B parameters. So it's still better that less parameters are activated for each token. 17B parameters works ~6x faster than 109B parameters just because less data needs to be loaded from RAM.
- TOMDM 7 months ago
  
  Yes loaded from RAM and loaded to RAM are the big distinction here.
  It will still be slow if portions of the model need to be read from disk to memory each pass, but only having to execute portions of the model for each token is a huge speed improvement.
  
  4 replies →

terhechte 7 months ago

To add, they say about the 400B "Maverick" model:

> while achieving comparable results to the new DeepSeek v3 on reasoning and coding

If that's true, it will certainly be interesting for some to load up this model on a private M3 Studio 512GB. Response time will be fast enough for interaction in Roo Code or Cline. Prompt processing is a bit slower but could be manageable depending on how much code context is given to the model.

The upside being that it can be used on codebases without having to share any code with a LLM provider.

anoncareer0212 7 months ago
Small point of order: bit slower might not set expectations accurately. You noted in a previous post in the same thread[^1] that we'd expect about a 1 minute per 10K tokens(!) prompt processing time with the smaller model. I agree, and contribute to llama.cpp. If anything, that is quite generous.
[^1] https://news.ycombinator.com/item?id=43595888
- terhechte 7 months ago
  
  I don't think the time grows linearly. The more context the slower (at least in my experience because the system has to throttle). I just tried 2k tokens in the same model that I used for the 120k test some weeks ago and processing took 12 sec to first token (qwen 2.5 32b q8).
  
  2 replies →

kristianp 7 months ago

To clarify, you're still gonna want enough RAM for the entire model plus context. Scout being 109B params means 64GB at q4, but then your context and other applications will have about 9GB left to work with.

tuukkah 7 months ago

109B at Q6 is also nice for Framework Desktop 128GB.

nrp 7 months ago
Yes, this announcement was a nice surprise for us. We’re going to test out exactly that setup.
- rubymamis 7 months ago
  
  Awesome, where can we find out the results?
  
  2 replies →
- rcarmo 7 months ago
  
  Can’t wait.
theptip 7 months ago
Is the AMD GPU stack reliable for running models like llama these days?
- rubatuga 7 months ago
  
  Running yes, training is questionable
echelon 7 months ago
I don't understand Framework's desktop offerings. For laptops their open approach makes sense, but desktops are already about as hackable and DIY as they come.
- nrp 7 months ago
  
  We took the Ryzen AI Max, which is nominally a high-end laptop processor, and built it into a standard PC form factor (Mini-ITX). It’s a more open/extensible mini PC using mobile technology.
  
  6 replies →
- elorant 7 months ago
  
  It’s an x86 PC with unified RAM based on AMD’s new AI cpus. Pretty unique offering. Similar to Mac studio but you can run Linux or Windows on it, and it’s cheaper too.
  
  3 replies →

echoangle 7 months ago

Is it public (or even known by the developers) how the experts are split up? Is it by topic, so physics questions go to one and biology goes to another one? Or just by language, so every English question is handled by one expert? That’s dynamically decided during training and not set before, right?

ianbutler 7 months ago

This is a common misunderstanding. Experts are learned via gating networks during training that routes dynamically per parameter. You might have an expert on the word "apple" in one layer for a slightly lossy example.
Queries are then also dynamically routed.
refulgentis 7 months ago

"That’s dynamically decided during training and not set before, right?"
^ right. I can't recall off the top of my head, but there was a recent paper that showed if you tried dictating this sort of thing the perf fell off a cliff (I presume there's some layer of base knowledge $X that each expert needs)
sshh12 7 months ago

It can be either but typically it's "learned" without a defined mapping (which guessing is the case here). Although some experts may end up heavily correlating with certain domains.

api 7 months ago

Looks like 109B would fit in a 64GiB machine's RAM at 4-bit quantization. Looking forward to trying this.

tarruda 7 months ago

I read somewhere that ryzen AI 370 chip can run gemma 3 14b at 7 tokens/second, so I would expect the performance to be somewhere in that range for llama 4 scout with 17b active

scosman 7 months ago

At 109b params you’ll need a ton of memory. We’ll have to wait for evals of the quants to know how much.

terhechte 7 months ago
Sure but the upside of Apple Silicon is that larger memory sizes are comparatively cheap (compared to buying the equivalent amount of 5090 or 4090). Also you can download quantizations.
- behnamoh 7 months ago
  
  I have Apple Silicon and it's the worst when it comes to prompt processing time. So unless you want to have small contexts, it's not fast enough to let you do any real work with it.
  Apple should've invested more in bandwidth, but it's Apple and has lost its visionary. Imagine having 512GB on M3 Ultra and not being able to load even a 70B model on it at decent context window.
  
  4 replies →
- lostmsu 7 months ago
  
  At 4 bit quant (requires 64GB) the price of Mac (4.2K) is almost exactly the same as 2x5090 (provided we will see them in stock). But 2x5090 have 6x memory bandwidth and probably close to 50x matmul compute at int4.
  
  3 replies →
- refulgentis 7 months ago
  
  Maybe I'm missing something but I don't think I've ever seen quants lower memory reqs. I assumed that was because they still have to be unpacked for inference. (please do correct me if I'm wrong, I contribute to llama.cpp and am attempting to land a client on everything from Android CPU to Mac GPU)
  
  8 replies →

anon373839 7 months ago

Unless I'm missing something, I don't really think it looks that attractive. They're comparing it to Mistral Small 24B and Gemma 3 27B and post numbers showing that is a little better than those models. But at 4x the memory footprint, is it worth it? (Personally, I was hoping to see Meta's version of a 24-32B dense model since that size is clearly very capable, or something like an updated version of Mixtral 8x7B.)

manmal 7 months ago

Won’t prompt processing need the full model though, and be quite slow on a Mac?

terhechte 7 months ago

Yes, that's what I tried to express. Large prompts will probably be slow. I tried a 120k prompt once and it took 10min to process. But you still get a ton of world knowledge and fast response times, and smaller prompts will process fast.

tintor 7 months ago

Not as fast as other 17B models if it has to attend to 10M context window.

levzzz 7 months ago

[dead]