I parsed "reasonable" as in having reasonable speed to actually use this as intended (in agentic setups). In that case, it's a minimum of 70-100k for hardware (8x 6000 PRO + all the other pieces to make it work). The model comes with native INT4 quant, so ~600GB for the weights alone. An 8x 96GB setup would give you ~160GB for kv caching.
You can of course "run" this on cheaper hardware, but the speeds will not be suitable for actual use (i.e. minutes for a simple prompt, tens of minutes for high context sessions per turn).
You might want to clarify that this is more of a "Look it technically works"
Not a "I actually use this"
The difference between waiting 20 minutes to answer the prompt '1+1='
and actually using it for something useful is massive here. I wonder where this idea of running AI on CPU comes from. Was it Apple astroturfing? Was it Apple fanboys? I don't see people wasting time on non-Apple CPUs. (Although, I did do this for a 7B model)
That said, I wouldn't necessarily recommend spending $20,000 on a pair of Mac Studios to run models like this. The performance won't be nearly as good as the server-class GPU hardware that hosted models run on.
The reason Macs get recommended is the unified memory, which is usable as VRAM for the GPU. People are similarly using the AMD Strix Halo for AI which also has a similar memory architecture. Time to first token for something like '1+1=' would be seconds, and then you'd be getting ~20 tokens per second, which is absolutely plenty fast for regular use. Token/s slows down at the higher end of context, but it's absolutely still practical for a lot of usecases. Though I agree that agentic coding, especially over large projects, would likely get too slow to be practical.
I parsed "reasonable" as in having reasonable speed to actually use this as intended (in agentic setups). In that case, it's a minimum of 70-100k for hardware (8x 6000 PRO + all the other pieces to make it work). The model comes with native INT4 quant, so ~600GB for the weights alone. An 8x 96GB setup would give you ~160GB for kv caching.
You can of course "run" this on cheaper hardware, but the speeds will not be suitable for actual use (i.e. minutes for a simple prompt, tens of minutes for high context sessions per turn).
Models of this size can usually be run using MLX on a pair of 512GB Mac Studio M3 Ultras, which are about $10,000 each so $20,000 for the pair.
You might want to clarify that this is more of a "Look it technically works"
Not a "I actually use this"
The difference between waiting 20 minutes to answer the prompt '1+1='
and actually using it for something useful is massive here. I wonder where this idea of running AI on CPU comes from. Was it Apple astroturfing? Was it Apple fanboys? I don't see people wasting time on non-Apple CPUs. (Although, I did do this for a 7B model)
MLX uses the GPU.
That said, I wouldn't necessarily recommend spending $20,000 on a pair of Mac Studios to run models like this. The performance won't be nearly as good as the server-class GPU hardware that hosted models run on.
The reason Macs get recommended is the unified memory, which is usable as VRAM for the GPU. People are similarly using the AMD Strix Halo for AI which also has a similar memory architecture. Time to first token for something like '1+1=' would be seconds, and then you'd be getting ~20 tokens per second, which is absolutely plenty fast for regular use. Token/s slows down at the higher end of context, but it's absolutely still practical for a lot of usecases. Though I agree that agentic coding, especially over large projects, would likely get too slow to be practical.
3 replies →
Mac studio way is not "AI on CPU," as M2/M4 are complex SoC, that includes a GPU with unified memory access.
1 reply →
I think you can put a bunch of apple silicon macs with enough ram together
e.g. in an office or coworking space
800-1000 gb ram perhaps?