Comment by zmmmmm

10 hours ago

Curious what would be the most minimal reasonable hardware one would need to deploy this locally?

I parsed "reasonable" as in having reasonable speed to actually use this as intended (in agentic setups). In that case, it's a minimum of 70-100k for hardware (8x 6000 PRO + all the other pieces to make it work). The model comes with native INT4 quant, so ~600GB for the weights alone. An 8x 96GB setup would give you ~160GB for kv caching.

You can of course "run" this on cheaper hardware, but the speeds will not be suitable for actual use (i.e. minutes for a simple prompt, tens of minutes for high context sessions per turn).

Models of this size can usually be run using MLX on a pair of 512GB Mac Studio M3 Ultras, which are about $10,000 each so $20,000 for the pair.

  • You might want to clarify that this is more of a "Look it technically works"

    Not a "I actually use this"

    The difference between waiting 20 minutes to answer the prompt '1+1='

    and actually using it for something useful is massive here. I wonder where this idea of running AI on CPU comes from. Was it Apple astroturfing? Was it Apple fanboys? I don't see people wasting time on non-Apple CPUs. (Although, I did do this for a 7B model)

    • MLX uses the GPU.

      That said, I wouldn't necessarily recommend spending $20,000 on a pair of Mac Studios to run models like this. The performance won't be nearly as good as the server-class GPU hardware that hosted models run on.

    • The reason Macs get recommended is the unified memory, which is usable as VRAM for the GPU. People are similarly using the AMD Strix Halo for AI which also has a similar memory architecture. Time to first token for something like '1+1=' would be seconds, and then you'd be getting ~20 tokens per second, which is absolutely plenty fast for regular use. Token/s slows down at the higher end of context, but it's absolutely still practical for a lot of usecases. Though I agree that agentic coding, especially over large projects, would likely get too slow to be practical.

      3 replies →

I think you can put a bunch of apple silicon macs with enough ram together

e.g. in an office or coworking space

800-1000 gb ram perhaps?