Comment by iagooar

1 day ago

I love my MacBook Pro M5 128GB RAM and I love qwen3.6.

BUT DO NOT buy this MacBook if you plan on doing serious coding using local LLMs with it. The reason is simple: your fingers will burn and your head will explode from the noise.

Running any kind of sophisticated job on the very laptop you are using is just not viable. Sure you can use it in clamshell mode, but forget touching it while working with AI coding or agents.

If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk. Connect to it over LAN or Tailscale. The MacMini will also cost you almost 1/3 of the MacBook Pro.

Thank me later.

I'm surprised no one has else has mentioned - low power mode.

With no speculative decoding, using high power mode, I get 80 t/s on 35B A3B - and it gets hot and spins up. On low power mode I get 38 t/s - no fans, cool to warm laptop.

If you currently don't use speculative decoding and you start using it, it can nearly offset the difference between high and low power, and it's night and day experience.

I almost always keep my laptop on low power mode.

  • Awesome idea! Will try it out. Wish there was a way to enable low power on a per-app basis. Scrolling and reading on low power mode is really annoying.

    • > Wish there was a way to enable low power on a per-app basis.

      Since you can control the low power mode setting from the command line: `sudo pmset -a lowpowermode 1`.

      It should be pretty straightforward to hook this up to Hammerspoon[1] using hs.application.frontmostApplication() to apply the setting based on whatever foreground application you choose.

      Thinking out loud, that being said, the necessity of sudo might make this slightly more complex. An always on background admin agent might be needed I suppose to bypass the password prompts (or add pmset to the sudoers file, if you prefer).

      [1]: https://www.hammerspoon.org/

      1 reply →

  • Can you mention what inference stack you're using? I've tried MTP several times with that model and it always seems to significantly cut my token generation speed from ~60 tokens/sec to ~40 (M3 Max).

  • Will give this a try later. Enjoy working with A3B Coder, but the heat coming out my 32gb M5 is a lot. This might be the trick - Thanks!

> MacBook Pro M5 128GB RAM

614 GB/s of memory bandwidth

> MacMini M4 with 64GB of RAM

273 GB/s of memory bandwidth (also only currently available with 48GB)

When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible. In this case a hypothetical Mini with 1TB of memory would still be over 2x slower with 27-35B models.

And FWIW I have an M4 Max MBP 128GB that I keep on a Roost laptop stand, with a separate keyboard/mouse/video. It does fire up the cooling jets when running local LLMs, but stays within tolerance for me on noise. I haven't heat-tested it on longer runs, but I imagine the risen airflow helps a ton.

  • On paper the M4 should be roughly 1/3 of the M5, in practice it is only 1/2. With the right, optimized model like qwen3.6 35B MoE MLX you can get over 40 tok / sec on it. I run dozens of background jobs that are not time-critical on it.

  • > When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible.

    This is only true when your GPU isn't bottlenecked building a KV cache, which it usually will be on Apple Silicon. The Achilles heel of the M-series chips are their weak, SOC-grade GPU that holds back the Max and Ultra models from having interactive TTFTs on larger models and contexts.

    • Normally people refer to the compute-bound phase as "prefill". Nothing wrong with saying it's building the kv cache though, it's accurate just unusual.

I opted to buy a normal 32GB laptop for this very reason. I know how loud and hot the GPUs in my desktop run when running even smallish models like Qwen 27B or Gemma 4 31B (which is a better model for most than Qwen 3.6, despite the benchmarks). I also have a Strix Halo which doesn't get loud, because it has a single huge fan, but it does get hot. So, there's no way a laptop could work as hard as models make them work, and not be unbearable. Tiny fans trying to remove all that heat? They gotta be screaming. No reason to spend all that money on a laptop that I couldn't realistically make use of. I do run a lot of VMs on my desktop, but I can get to those on a VPN.

It's a nice idea to run a model on a laptop so you can work anywhere...but, that's a job for models in the cloud. Not much data has to traverse the network, so it's not a big deal. Or one could also setup a VPN so you can reach a self-hosted model on a big box at home for things that require data privacy.

All that said, there are models that work great on very small devices for some tasks and won't work it to death. Gemma 4 12B QAT 4-bit runs on a 16GB device, maybe even smaller, including a tablet. It's the best self-hostable vision model I've tested for my purposes (categorization, identification, labeling, type stuff), beating much larger models. It's also a decent conversationalist with good prose but it doesn't know much of anything (not a lot of the world fits in 7GB), so it needs search if you want to use it for research. It's a pretty good tool user. I definitely wouldn't want to use it for code, though, beyond very simple stuff.

  • Gemma is better than Qwen at everything except coding, in all my evaluations. Which is a shame because that is what I use them for!

    • I have a M1 Macbook Pro...with only 16gb and I struggled with Qwens2.5-14b trying to do large projects. I loved Qwen but I had to try and do something different. So I switched to Gemma4-12b which looking at it now, seems more like a downgrade than an upgrade.Can you refer me to any Qwen coding models that wont choke my poor 16gb and also connect contextually? I need that context. I love the laser point focus, but I need context and basic understanding of that context.

    • I haven't run a proper eval, but I've been getting better luck with Qwen models than Gemma on plant and animal identification using vision.

      I do like Gemma for translation, however.

  • You can limit TDP on Strix Halo so it runs between 32 and 45W which seems to be the sweet spot for heat vs speed.

> The reason is simple: your fingers will burn and your head will explode from the noise.

So, just buy a mac mini and put it in the other room? ( Like everyone was doing in February? :)

I've been running coding agents on my laptop in yolo mode for the past half year or so (though mostly not local ones, laptop too slow!) and the way I'm doing that without terror is that I just gave them their own Linux user "agent". They're free to nuke their homedir /agent, and they can't touch (or even read) mine.

There's some slight ergonomics issues (I need to sudo into the user to do anything, but I set up an alias for it), sometimes I get issues with permissions or ownership (gave up on "sticky bits" and just made a function I can run once a day when it breaks).

There's enough hassle that I wish I just had a dedicated machine for it, and then I'd just give them root on it. (For giggles I gave claude root on a $3 VPS and that's going just fine...)

But yeah after months of trial and error I reinvented "just buy a mac mini" from first principles...

  • Just buy a Mac Mini really is good advice if you want to get into real, always-on convenient agentic work.

    Soon it is going to be good even for coding using local LLMs. Until then, just run API models on it for coding, local LLMs for "knowledge" work or daily driver agent like Hermes.

  • I think more bought them to run their Clawed on it but still with external LLM calls.

    There should be a lot more content on setups and best practices etc. if these macs would be used with local models only.

In general if you're setting up a local LLM you should assume it's going to be primarily working as a server and talking to various clients. I use my MBP, but that's because I don't travel much anymore so it can happily work as a server at all times. With the right agent setup you can probably manage most things from your phone even if you don't have a seperate machine to use as a client.

I have an older laptop I run a hermes agent on backed by an API based open (non-local) model and Macbook Pro M4 for running another model locally (also using hermes). The agents have a Mattermost (open source version of slack) server they run and I run Mattermost on my phone so I can talk to them and task them with things. In fact, it was through the hermes WhatsApp endpoint that I got the first agent (non-local) to setup the Mattermost server and unboard the second agent (local mbp).

Then I can just chat with them through Mattermost when I need work done. Whenever I need something done I just hope on the Mattermost server and chat with them. I've had them build me multiple research reports (the fully local agent did awesome at this), learn how to use Stable Diffusion on my desktop to generate images, install and perform maintenance on various local services I run (including Open WebUI).

Nope, have both these machines, can confirm the M5 max blows the M4 mini away. It does get hot, but I use it mostly with an external monitor and keyboard. Conceptually I like the headless model better with a workstation, but work was buying the M5 and can't get it in any other form factor at the monute.

Yikes! I've been needing an upgrade, and I was on the fence between a specc'd out MBP, or building out a AI server and delegating tasks to it over Netbird/Tailscale to my homelab.

I'm mainly interested in coding/image creation tasks. Has anyone built out a server for a similar use-case and, if so, whats your experience been? What cards should I be looking into? Am I looking at spending ~10-15k for something that can give me near frontier quality/speed? I know about the DGX Spark/Mac Mini's, but I'd like to be able to upgrade later down the road.

I have that model, and do local LLMs and local image generation. DO buy this if you plan on serious local LLM use and enjoy working from anywhere.

Don't expect workstation loads with no fan or heatsink, true. But it's not a real problem, it's still quieter than a desktop.

That said, rather than Mac Mini, if you only work from one place, I'd recommend a Studio Ultra M3 with 512GB. Same or more tokens per second, multiple models loaded. Cool and quiet.

Apple does not sell a 64GB variant of the M4 Mac Mini. IIRC they never have; its always capped out at 48GB.

If you were planning on getting an M5 128GB; just get a DGX Spark (~$4500) or a 5090-equipped machine (~$4500) plus a Macbook Air (~$1500). You'll come in below the M5 Max 128 pricing (~$6700+ USD) and be happier for it.

  • The Mac mini was available with 64GB of RAM literally 4 days ago; the option was discontinued on June 25th.

  • DGX Spark everyone is saying performance for the money is not there

    • I have an access to a DGX spark, and while it performs better than my MacBook Pro (M3 Max), the performance on Qwen and Gemma dense models is dog shit, and not worth it.

      2 replies →

  • I'm using a 64GB M4 Mac Mini.

    They pulled them a month or two ago, right after I bought it.

  • That's incorrect, I have one on my desk right now. They've stopped selling it now, but I got one a year and a half ago:

    > Apple M4 Pro chip with 14‑core CPU, 20‑core GPU, 16-core Neural Engine 64GB unified memory 2TB SSD storage 10 Gigabit Ethernet Three Thunderbolt 5 ports, HDMI port, two USB‑C ports, headphone jack Accessory Kit $2,649.00

I think there is no reasonably priced machine you could run locally to do serious work with LLMs...

10x rtx6000 Pro in a large workstation is probably the way to go for someone wanting to run GLM5.2.

Other than that it is cloud.

As good as these small models got we are still not "at breakeven" for me.

What is "breakeven" with LLMs? For me it is when I no longer have to read the actual code it wrote. I can trust that if I told it to implement and document a certain architecture it actually did that with no stupid mistakes.

The first model ever that did that for me was the first opus. 4.4 if I remember correctly.

The second model was Gemini 3 Pro preview. For few weeks. Then it was lobotomised. I guess it was too expensive to run and they quantized it too hell.

Only Opus remains. If this GLM model truly rivals even an old opus I'll be very happy when day comes that I'll be able to run it locally.

I am using MacBook Pro M4 with 64GB of RAM and I have it on direct path of air conditioning airflow, 40ish cm from the device, while running LM Studio opened to network. No noise, not hot to the touch.

Using linux for actual work on my workstation.

Would the new upcoming AMD AI ryzen halo desktop be a better value offer? or dgx spark?

You would have to get a third party reseller/scalper or refurbished mac mini to get 64gb of ram ever since apple stopped selling it.

  • My GB10 Spark-alike is absolutely amazingly fun… but it is not cost effective. Step 3.7 Flash is shockingly capable (IQ4_XS and used for web dev mainly), but it cost me $6800 AUD. They’re even more expensive now. The numbers just don’t make sense: with proper triple head MTP I can get it up to ~40tk/s decode and it runs at around 1000+ tk/s prefill.

    $6800 is a lot of API credits for GLM, for example, on any provider you want to use.

    Now being able to run models uncensored and with privacy has value! But the cost for these is rough today.

    I still am going to buy a second one haha

  • My 2c: you don't need the Strix Halo desktop, the chip comes in many rigs, most of them cheaper, the performance difference isn't worth it. It used to be half the price of a DGX Spark or a Mac with 128GB RAM. If you can still find it at that price I'd say it's the best bang for your buck. Otherwise, Macs have 2-3x the memory bandwidth of the DGX Spark, depending on the chip, so I'd prefer them. Unless you're planning on building a cluster. The DGX Spark has two 100GB/s connectors, ideal for clustering. But I haven't checked what else you could get for the price of two DGX Sparks.

    • Thoughts on a M5 Ultra 768GB if it drops? What's the price to make it worth it for you over a spark cluster?

      I'm wanting to run Kimi 2.6/2.7 GGUF on it and just slap it in the server rack, but trying to decide if a spark cluster makes more sense.

      2 replies →

  • I'm currently fiddling with a DGX Spark and Qwen3.6-35B-A3B (specifically Qwen3.6-35B-A3B-NVFP4 under vLLM, with EAGLE3 speculative decoding via eagle3-dogacel-vllm), and it's pretty okay in terms of smarts. The speed is relatively usable at about 50 tok/sec with a 256k context window, and it's definitely smart enough to one-shot some basic coding tasks. I had it doing reverse engineering/disassembly of some ancient MS-DOS assembly language games from the 80s and it handled the task well and produced good outputs.

    But it's also really easy to trip up. I fed it some of my Ars pieces and asked it to analyze themes and composition, and it got into a looping argument with me over how it was unable to analyze "my" writing because "the user cannot be the article author, the user is the user, the user did not write the article, the article author wrote the article." I was utterly unable to convince it that I was in fact me.

    Qwen3.6-35B-A3B hums along at about 50GB of RAM used with --gpu-memory-utilization=0.42. I haven't tried Qwen3.6-27B (I'd likely grab Qwen3.6-27B-FP8, I think), but I'm curious to see if it makes much of a difference.

    • Compared to a dynamic quant like Unsloth's UD-Q4_K_XL, which keeps some important parameters in higher precision, a basic NVFP4 quant seems to do a lot more damage to the model unless it is carefully calibrated.

      I would recommend using llama-server if you're just on a single Spark. You get access to dynamic quants like that more easily, the performance is not that different from vLLM most of the time these days, and it is much faster and easier to switch between models.

      As far as intelligence goes, Qwen3.6-27B is much smarter than the 35B-A3B model, but that's also not the sort of thing to argue with an AI model about in the first place. Just open a new chat and try again.

      Gemma-4-31B is not as good at agentic use cases as Qwen3.6-27B, but it is a fairly balanced model overall, and worth trying out too. Its MTP can nearly triple the performance of the model, where the benefits of MTP or Eagle seem more limited for Qwen3.6-27B in my testing, maybe doubling the speed.

    • There are also nvfp4 quants of Qwen 3.6 27/35 floating around. I've done benchmarks of both and the quality difference vs fp8/bf16 was barely notable. Honestly the nvfp4 capability is the most interesting feature of the Spark (at least for me).

    • I use Qwen 3.6 35B-A3B constantly, but I don’t see the type of behavior you mentioned. I’m using Unsloth’s Q8_K_XL quant.

    • `llama-server` looping mitigations --repeat-penalty something greater than 1.0, set reasoning/thinking OFF explicitly, prefer a gguf with more than 4bit quant

  • Check the LLM benchmarks once it's out: it's such a common use case for these kinds of machines, you won't be waiting long.

I have an M4 Max and when I was trying out local LLM work with pi it has probably felt like the hottest I've ever felt any kind of Macbook be. I could feel the radiated heat off it even a few inches away. Honestly felt hotter than any Intel Macbook I've used. Because of that I stopped as I didn't want to harm my laptop in case I need to hold it for 10 years due to all the supply issues/price increases.

  • I tried to run it on a M4 Air for shits and giggles.

    After about 1 minute the entire machine basically bricked and I had to hard reset :D

running potentially sota open-weight models locally only became a thing in fall 2023.

if a hardware cycle takes ~3 years then fall 2026 would be the first possible device generation where apple exploits its advantage with the unified ram architecture.

more realistically, spring 2027, since they probably also needed some time to make up their minds to lean into that on the top end.

that`s also how i would interpret the recent rumors on m6 and m7.

naturally, the cooling and all that will be optimized around that.

so the first devices that are actually intended and designed for this use case will come at the earliest this fall and more likely in q1/q2 next year.

you are basically paying the price now to be on the bleeding (sweating) edge

Try using DwarfStar 4 and use the --power flag: https://github.com/antirez/ds4#reducing-heat-power-usage-and...

  • DwarfStar is the only thing I've run that doesn't try and make my Mac Studio 128GB take off. Yes, it gets hot while doing inference but quickly cools down when idling, something I haven't experienced with Ollama, LMStudio or OMLX.

This. Do consider local LLMs, but set aside a dedicated machine for it. Connect via VPN or reverse proxy. If it's not a Mac them I'd also put a server distro on it. No need for a desktop environment, save your RAM.

  • I have a Linux box with two 3090s and it's been great for running Qwen3.6 27b. I lowered the power on each card down to 250w, and then built a small ducting/fan system to vent the waste heat outside. The machine is pretty much silent, and I'm still getting 110 tokens per second out of it for coding tasks.

    https://github.com/tedivm/qwen36-27b-docker

    • How useful is the second 3090 in this setup? I run the 5-bit quantized model on a single 3090. Does the second 3090 allow you to use the full precision model instead or a less aggressive quantization by splitting the layers? What about running the 35B model instead?

    • But is Qwen3.6 27B actually worth this investment? If I had to guess you still use SOTA for architectural/planning work?

That's exactly what I'm doing -- Mini M4 Pro 64GB, qwen3.6.

My hearing is not great, but I think I would have noticed the fan, and I have never heard it. In fact, I had to google to find out if it even has a fan.

  • I'm still kicking myself for buying a 32GB M1 Max Studio two years ago when it wouldn't have been that difficult to get a 64GB instead.

If you want to do coding with a local LLM your best bet is a 6 year old Nvidia 3090 which is substantially more powerful than the highest end overhyped Apple product for 1/5th the price.

  • The cheapest 3090s I could find with any sort of guarantee were pushing $1500.

    An AMD AI Pro R9700 32GB brand new is $1350 right now.

    After some tweaking, I had it running faster than the models the 3090 could run, and it could obviously run with higher context limits and bigger models due to the extra vram.

  • That’s 24GB VRAM. Not enough to run a 27B model at a useful quant+context size.

  • My problem is I won't accept anything lower than the 96GB the RTX Pro 6000 Blackwell has. My dream is a workstation with 2x Pro 6000 to run DeepSeek v4 Flash comfortably, possibly qwen 3.6 / ornith on turbo speed.

    But man, I have never purchased a computer which is more expensive than a decent family car.

  • An M1 Ultra has 800gbps unified memory. It’s nothing to do with Apple, it’s their microarchitecture. They’re just about the only game in town with high-bandwidth memory if you want >24GB (for less than $10k, anyway).

    • A 5090 gets you 32GB with 1.8 TB/s of memory bandwidth for ~$4k, RTX A6000 gets you 48GB at 768 GB/s for ~$3.5k, 2x 3090 gets you 48GB for $2000 or so, and if you're willing to go into the wilderness, there are much cheaper options like the AMD MI50.

      1 reply →

    • I'd also like to call out that "high bandwidth memory" (HBM) is a specifically defined thing[0], and is used in high end GPUs, and notably not used in Apple's machines.

      I know you probably weren't referring to this type of memory in your post, but IMO it might be worth avoiding this term in the future unless you're referring to HBM, the standard.

      [0] https://en.wikipedia.org/wiki/High_Bandwidth_Memory

    • Yeah this is just not the case at all; a 5090 or any of the recent nvidia workstation cards all fit this criteria.

      Also, while memory bandwidth is important, it isn’t the only consideration. Apple’s architecture has memory bandwidth equal to a mid-range consumer GPU, but its GPU speed is much, much worse than, say, a 5080 or 5090. This translates into e.g. much slower time to first token on Mac systems compared to dedicated GPUs.

I'm running an M5 Max 128GB with Qwen 3.6 and unreal engine in the background and it seems to be ok for me. Quite a power drain if it's not plugged in but I haven't seen any thermal issues.

So the sweet spot for dev in 2026 is 64k context windows? Are we back in 2024?

As more context will degrade a lot the t/s. On top this is 1 slot.

If you use sub agents the kv cache will be invalidated with colliding request and make it even slower.

So the in real world 256k (the max qwen offer) and using 3-4 slots the numbers are very different.

This is the major issue with so many postes over local models not benchmarking real world use. Real context and not taking this in context.

If you use 1 slot the issue, you loose the ability of using sub agents when exploring and all end up in the main agent context overloading it, triggering compactation and oh boy with 64k context that compecation will be an endless loop.

What tasks you would really be able to do with 64k context 1 agent? For sure so quick edits but not complex planning where you need to ingest a lot files and end up loosing 80% of the ingested files to compactation.

No laptop is thermally designed to handle sustained high workloads. The whole point of a laptop is to keep it thin, quiet and light, the exact opposite of what cooling needs.

Don't forget that your OLED screen will start to color-shift as the heat cooks the panel!

  • There is no MacBook Pro with OLED (yet).

    • My mistake on tech; it’s a beautiful display. Alas, I speak from experience when it comes to the thermally-caused color shift. Hopefully it’ll be AppleCare covered.

You can use a fan app to ramp up how fast the fans spin instead of the default so you can prevent any throttling

It's okay, completely wrong thread for this statement, but I wouldn't voluntarily use current MacOS (no idea if the older variants weren't terrible) over anything but ssh. Worse than Windows 11.

  • "macOS" (or however they spell it now) is pretty bad, but I'm not sure it's possible Apple could ever possibly produce an OS as bad as Windows 11 lol, it's really surprising to me to see someone suggest it's somehow actually worse?! How many times has an Apple OS wiped your hard drive or otherwise been completely borked from a forced update? I know multiple people personally who have experienced this with Windows 10/11, not once with a Mac. Just that alone is like the end of the argument for me, ignoring all the shockingly brutal UI problems.

    • >How many times has an Apple OS wiped your hard drive or otherwise been completely borked from a forced update

      I use Windows and this has never happened to me. I have had Macbooks I cant open to fix/replace something trivial while I can replace any part easily on a Windows PC/laptop though.

      2 replies →

I just checked apple's website and configured them:

Mac Studio: Ships: 16–18 weeks

Mac mini: Ships: 10–12 weeks

Yes, it gets really hot really fast.

As much as I was tempted to use it on longer projects, I had some reservations about whether it would put too much strain on my MacBook.

Thank you - I was very close but thanks to chores and availability haven't pulled the trigger. You are very convincing.

> If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk.

Can confirm this works rather well, most things that integrate with LLMs, (agents, editors), support providing a remote (LAN) URL for Ollama, LM Studio etc.

But you do need a fast LAN connection, otherwise working with agents will be a pain.

  • > you do need a fast LAN connection

    Huh, how come? Low-latency I can understand, but I was under the impression that token throughputs were still barely exceeding dialup bandwidths.

  • I disagree LAN connection is the bottleneck. I do even work with it remotely via Tailscale on shaky hotel WIFI and it works fine (or as fine as any other API-based model).

A local model on my m2 made me come to that conclusion but I definitely was having “that config is $2k more” regret. Thanks for posting this!

You can get some work done by using low power mode even when plugged in, and making your fan start running when the temps just start to rise (maybe 40 degrees. Use a third party fan app to set it up

I am considering getting something like NVIDIA's RTX Spark when it comes out, though even that will be limited to 128GB.

  • They’ll sell you a bundle, either a pair or a quartet so you can have 256 or 512GB over a 400GB/s network link

    I can’t figure out when it makes sense to pay 10k up front for a quantized Llama 3.1 but it’s an interesting option

    • You could fit a Q4 GLM5.2 in 512GB and still have some space for context (372-475GB for the model): https://unsloth.ai/docs/models/glm-5.2

      But yeah, there's a bit of a dearth of models that could fully utilize memory in the 128-256GB bracket at the moment. But things move so fast in this space, I wouldn't base my decision on a generation of models that's just a few months old.

      1 reply →

    • Not Llama 3.1, but Step 3.7 Flash is one of the few new high quality models in this size bracket. DeepSeek v4 Flash too

    • 10k is rather a lot yes. For LLMs you can use a lot of tokens with 10k with less hassle without the machine (and also it's not like electricity is free), but for some other things like video models 10k would get burned very fast. I am looking for something more in the 5k range though.

Can you define "serious programming"? Because I use it to implement things I COULD go and figure out like algorithms or test generation or evaluations etc, the "serious" programming I tend to do myself. That is what I'm paid for.

  • Serious programming is dealing with a large knowledge surface area.

    So not "implement me a shading algorithm"

    But more like: make an multi user app running on a k8 cluster, design the whole thing to be indempotent, scalable, easy to deploy remotely via ipmi/pxe boot.

    Then see how it makes stupid mistakes along the way.

    Today's AI is pretty amazing when it comes to fixing narrow problems (or creating Web apps with no infra). Give it anything where it needs to go online, download some helm templates and look through them to figure out parameters, as well as write an app and it will make lots of mistakes in seemingly simple stuff.

    Opus seems to be the model that works the best with this.

  • Serious programming is using as many agents and loops as possible because anthropic needs you to spend more on tokens

What sort of M5 are you running? A max? MacMini's don't offer max CPUs.

  • M5 Max. But I also have a MacMini M4 Pro 64GB. Qwen3.6 runs on the M4 just fine - sure the M5 is at least 2x the speed. If Apple launches a MacMini with an M5, I will be the 1st one to get it.

    • You're only going to get an incremental improvement with an M5 Pro mini compared to an M4 Pro mini. Memory bandwidth goes from 273GB/s to 307GB/s, about 12.5% improvement for LLMs.

      2 replies →

>Sure you can use it in clamshell mode

Wouldn't this damage the MBP display?

My RTX laptop has air intake underneath the keyboard and clamshell mode is surely a recipe for disaster; I've taken numerous measures to ensure that the laptop doesn't stay awake when the lid is down.

I completely disagree, it is probably the best platform currently for this - and the way I run it is as a server with tailscale accessible from my coding machine (same as you suggest here) - the difference is that you can stop the server, use it as a video editing rig on a whim, or use it for training instead of inference (yes PyTorch has caught up and Metal is a great platform for this now).

It’s just so flexible, and I even use it in agent mode (ds4) directly on the machine as well sometimes (it’s really not that bad, I’m often running inference for small side projects on my couch), if there is another machine that can do all of this and still function as one of the more ergonomic, well built, and compact laptops out there, I’d love to hear what it is cause I’d likely be interested!

TBF, I just recently picked up this same model, and it's reminding me of the last gen Intel i9 MBP. Just visiting any non-basic website spins up the fans and battery life isn't great either. Yes, this thing is fast, but damn it gets hot just using it for normal tasks.

Still, I don't agree. I think this machine is meant to use local models. You just have to wear pants if you want to keep it directly on your lap. I rarely use it that way anyway. I prefer it plugged into an external display and comfortably sitting on a laptop stand.

  • Is there something wrong with the m5s? I have an m4 pro and I’ve never heard the fan on it. I don’t do much with local llms, but I naturally use the web and play games (windows games at that with wine/crossover).

  • That seems very unusual for modern Apple Silicon. Our family has:

    - M3 Pro MacBook Pro 36GB

    - M2 Pro MacBook Pro 16GB

    - Mac Studio M4 Max 48GB

    and I have not heard the fans on any of them with normal use. The only time I've ever heard automatic fans was when I was using a local 12B model on the M3 MacBook Pro, and when running 70B models on the Studio.

    You should consider checking Activity Monitor and making sure that the usual suspects are not causing issues with sustained high CPU. And you can use an app like [Stats](https://mac-stats.com) if you want to see that info while actively using the computer.

  • As someone who just upgraded a month ago from the last Intel MBP to a new base M5 MBP, I think your laptop might have a problem. I'm definitely not experiencing any of what you describe when doing normal tasks.

Your MacBook will not last running current big LLMs on these hardware. The heat will wear on it.

Get an OEM Spark instead, mine are silent and can fit 2 qwen/gemma at 8bit or give you room for a bunch of other, smaller models (embed,rerank,etc)

This -- with the M5 Max MBP is running flat out, you'll go from full battery to empty in under two hours.

While it is wild to have this much power in a take-it-anywhere laptop form factor, I sort of regret not just going for a Mac Studio + base M5 MBP.

Today the Mini tops out at 48GB. Gotta go to the Studio to get 64GB.

  • Don't buy the Mini or Studio. Both have the M4 which lacks the Neural Accelerators, making prompt processing ~3-4x slower.

Also look into buying the Mac mini refurbished from Apple. They come almost brand new, same warranty and you save money.

What kind of speed in tk/s do you get with the MacBook?

  • qwen3.6 27B MLX 8bit -> 15 tok / sec. A bit slow but it is a delightful model to use, and smart too.

    qwen3.6 35B A3B MLX 8bit -> 85-90 tok / sec! It is impressively fast and roughly 90% as good as 27B (in my opinion).

This is a very exaggerated take. I have an Apple M5 Max with 128 GB ram running 15'ish Coasts (coasts.dev) environments, each of them running postgress, python, redis and FE stack + locally running voice models and face swap models .. and the only time the fan kicks in is when I open multiple google analytics tabs.

Same. And your M5 has acceleration that I don’t with my M3 max. I can’t do anything local it gets hotter than an Intel Mac trying to run docker from back in the day.

Very surprised an Apple device can have some atrocious ventilation design.

I'm running this model on a Framework 13 and the chassis barely heats up at all while running full tilt.

why not buy one of those "a.i" desktop kits being sold by Nvidia/AMD and just connect to them via network ?

to me that's cheaper than paying an LLM provider such as Anthropic spreading FUD around open weight models & more sustainable too.

  • It's still currently way cheaper to pay open router to run qwen for you. And you have the option to use much bigger better models like DeepSeek v4 flash.

>If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement

Im sorry, but its time to start calling Apple sycophants out. Stop trying to push your tech jewelry on other people. You only buy those computers because they are Apple, you don't know anything about computing or running LLMs, you don't do any real work, so you should probably not give advice on what to buy.

A single 3090 will run Qwen3.6 27b fine, and its VRAM speed is twice of what the best Mac has. And the build will be cheaper. Decent CPU/Motherboard, 32gb of DDR4 ram, an SSD and a Single 3090 should run max about $4grand. Mac m4 mini is 6grand.

Then, when gpu prices come down (or you find one on a deal), you can upgrade the card, or stick a second one, and benefit from more speed. You can't do that with the trash Apple produces.

Flag me if you want, I don't care. Its embarrasing for the tech community to give advice this bad.

  • I am not going to flag you, I am much OK with having good arguments.

    I just purchased a Mac Mini M4 Pro 64GB for $3k - 2nd hand of course.

    I am not a hater of Nvidia and I am planning on building a workstation based on RTX cards. You clearly do not seem to understand how convenient the MacMini actually IS - the form factor, how quiet it is, how durable it is, how well it integrates with other Macs, how well it works as a bridge to a personal agent like Hermes (integration with iMessage, Calendar, Reminders, iCloud, etc).

    I am pretty sure I know a thing or two about computing, I have been in the trenches for many, many years and I have had machines of all kinds, shapes and colors. It just so happens that Macs are very capable, very convenient machines that happen to work great in the era of LLMs, too.

    But you do you.

    • I have to see I'm in the rtx camp. A dual rtx3090 workstation with 200G of ram and zen5 9950x cpu. All watercooled.

      The only reason I can tell it's on, is the very quiet hum of the slow speed water pump. Large fans run at 1200rpm and are fully quiet.

      I have over a meter of radiators there.

      Fun fact, I bought my first rtx3090 4 years ago. A year ago I bought another one and they are still the same price used.

      I may buy another one (for my servers)

    • If you are in Apple ecosystem, and have reasons to own one besides inference, then buying a used Mac mini pro isn’t such a bad idea. I just bought a regular Mac mini just to provide a nice front end to my Ubuntu workstation. But if all you want is inference, then a cheap PC with a 32gb 9700 (or two!) in it is far cheaper. This specific thread was about someone who already has a MacBook. A cheap PC and GPU pairs well. Or a spark: slower but more memory. Or fuck it! Get a 5090 or a 6000!

    • >You clearly do not seem to understand how convenient the MacMini actually IS - the form factor, how quiet it is, how durable it is, how well it integrates with other Macs, how well it works as a bridge to a personal agent like Hermes (integration with iMessage, Calendar, Reminders, iCloud, etc).

      If you are that locked in to Apple, its pretty easy to buy a used Mac Mini older gen for all the non AI stuff.

      But this is a discussion about inference. Buying a Mac anything for any sort of local inference is a COLOSSAL waste of money.