Comment by simonw
12 days ago
I follow the MLX team on Twitter and they sometimes post about using MLX on two or more joined together Macs to run models that need more than 512GB of RAM.
A couple of examples:
Kimi K2 Thinking (1 trillion parameters): https://x.com/awnihannun/status/1986601104130646266
DeepSeek R1 (671B): https://x.com/awnihannun/status/1881915166922863045 - that one came with setup instructions in a Gist: https://gist.github.com/awni/ec071fd27940698edd14a4191855bba...
For a bit more context, those posts are using pipeline parallelism. For N machines put the first L/N layers on machine 1, next L/N layers on machine 2, etc. With pipeline parallelism you don't get a speedup over one machine - it just buys you the ability to use larger models than you can fit on a single machine.
The release in Tahoe 26.2 will enable us to do fast tensor parallelism in MLX. Each layer of the model is sharded across all machines. With this type of parallelism you can get close to N-times faster for N machines. The main challenge is latency since you have to do much more frequent communication.
> The main challenge is latency since you have to do much more frequent communication.
Earlier this year I experimented with building a cluster to do tensor parallelism across large cache CPUs (AMD EPYC 7773X have 768mb of L3). My thought was to keep an entire model in SRAM and take advantage of the crazy memory bandwidth between CPU cores and their cache, and use Infiniband between nodes for the scatter/gather operations.
Turns out the sum of intra-core latency and PCIe latency absolutely dominate. The Infiniband fabric is damn fast once you get data to it, but getting it there quickly is a struggle. CXL would help but I didn't have the budget for newer hardware. Perhaps modern Apple hardware is better for this than x86 stuff.
That's how Groq works. A cluster of LPUv2s would probably be faster and cheaper than an Infiniband cluster of Epycs.
3 replies →
Exo-Labs is an open source project that allows this too, pipeline parallelism I mean not the latter, and it's device agnostic meaning you can daisy-chain anything you have that has memory and the implementation will intelligently shard model layers across them, though its slow but scales linearly with concurrent requests.
Exo-Labs: https://github.com/exo-explore/exo
But that's only for prefilling right? Or is it beneficial for decoding too (I guess you can do KV lookup on shards, not sure how much speed-up that will be though).
No you use tensor parallelism in both cases.
The way it typically works in an attention block is: smaller portions of the Q, K and V linear layers are assigned to each node and are processed independently. Attention, rope norm etc is run on the node-specific output of that. Then, when the output linear layer is applied an "all reduce" is computed which combines the output of all the nodes.
EDIT: just realized it wasn't clear -- this means that each node ends up holding a portion of the KV cache specific to its KV tensor shards. This can change based on the specific style of attention (e.g., in GQA where there are fewer KV heads than ranks you end up having to do some replication etc)
3 replies →
Even if it wasn't outright beneficial for decoding by itself, it would still allow you to connect a second machine running a smaller, more heavily quantized version of the model for speculative decoding which can net you >4x without quality loss
Tensor Parallel test with RDMA last week https://x.com/anemll/status/1996349871260107102
Note fast sync workaround
I’m hoping this isn’t as attractive as it sounds for non-hobbyists because the performance won’t scale well to parallel workloads or even context processing, where parallelism can be better used.
Hopefully this makes it really nice for people that want the experiment with LLMs and have a local model but means well funded companies won’t have any reason to grab them all vs GPUs.
No way buying a bunch of minis could be as efficient as much denser GPU racks. You have to consider all the logistics and power draw, and high end nVidia stuff and probably even AMD stuff is faster than M series GPUs.
What this does offer is a good alternative to GPUs for smaller scale use and research. At small scale it’s probably competitive.
Apple wants to dominate the pro and serious amateur niches. Feels like they’re realizing that local LLMs and AI research is part of that, is the kind of thing end users would want big machines to do.
Exactly: The AI appliance market. A new kind of home or small-business server.
10 replies →
Power draw? A entire Mac Pro running flat out uses less power than 1 5090. If you have a workload that needs a huge memory footprint then the tco of the Macs, even with their markup may be lower.
I haven’t looked yet but I might be a candidate for something like this, maybe. I’m RAM constrained and, to a lesser extent, CPU constrained. It would be nice to offload some of that. That said, I don’t think I would buy a cluster of Macs for that. I’d probably buy a machine that can take a GPU.
I’m not particularly interested in training models, but it would be nice to have eGPUs again. When Apple Silicon came out, support for them dried up. I sold my old BlackMagic eGPU.
That said, the need for them also faded. The new chips have performance every bit as good as the eGPU-enhanced Intel chips.
4 replies →
I think it’s going to be great for smaller shops that want on premise private cloud. I’m hoping this will be a win for in-memory analytics on macOS.
The lack of official Linux/BSD support is enough to make it DOA for any serious large-scale deployment. Until Apple figures out what they're doing on that front, you've got nothing to worry about.
Why? AWS manages to do it (https://aws.amazon.com/ec2/instance-types/mac/). Smaller companies too - https://macstadium.com
Having used both professionally, once you understand how to drive Apple's MDM, Mac OS is as easy to sysadmin as Linux. I'll grant you it's a steep learning curve, but so is Linux/BSD if you're coming at it fresh.
In certain ways it's easier - if you buy a device through Apple Business you can have it so that you (or someone working in a remote location) can take it out of the shrink wrap, connect it to the internet, and get a configured and managed device automatically. No PXE boot, no disk imaging, no having it shipped to you to configure and ship out again. If you've done it properly the user can't interrupt/corrupt the process.
The only thing they're really missing is an iLo, I can imagine how AWS solved that, but I'd love to know.
3 replies →
Not sure I understand, Mac OS is BSD based. https://en.wikipedia.org/wiki/Darwin_(operating_system)
6 replies →
Almost the most impressive thing about that is the power consumption. ~50 watts for both of them? Am I reading it wrong?
Yeah, two Mac Studios is going to be ~400 W.
What am I missing? https://i.imgur.com/YpcnlCH.png
(Edit: interesting, thanks. So the underlying OS APIs that supply the power-consumption figures reported by asitop are just outright broken. The discrepancy is far too large to chalk up to static power losses or die-specific calibration factors that the video talks about.)
1 reply →
Can confirm. My M3 Ultra tops out at 210W when ComfyUI or ollama is running flat out. Confirmed via smart plug.