For chat type interactions prefill is cached, prompt is processed at 400tk/s and generation is 100-107tk/s, it's quite snappy. Sure, for 130,000 tokens, processing documents it drops to, I think 60tk/s, but don't quote me on that. The larger point is that local LLMs are becoming useful, and they are getting smarter too.
I'm not sure if you're just unaware or purposefully dense. It's absolutely possible to get those numbers for certain models in a m4 max and it's averaged over many tokens, I was just getting 127tok/s for 700 token response on a 24b MoE model yesterday. I tend to use Qwen 3 Coder Next the most which is closer to 65 or 70 tok/s, but absolutely usable for dev work.
I think the truth is somewhere in the middle, many people don't realize just how performant (especially with MLX) some of these models have become on Mac hardware, and just how powerful the shared memory architecture they've built is, but also there is a lot of hype and misinformation on performance when compared to dedicated GPU's. It's a tradeoff between available memory and performance, but often it makes sense.
LM Studio (which prioritizes MLX models if you're on Mac and they are available) - I have it setup with tailscale running as a server on my personal laptop. So when I'm working I can connect to it from my work laptop, from wherever I might be, and it's integrated through the Zed editor using its built in agent - it's pretty seamless. Then whenever I want to use my personal laptop I just unload the model and do other things. It's a really nice setup, definitely happy I got the 128gb mbp because I do a lot of video editing and 3d rendering work as a hobby/for fun and it's sorta dual purpose in that way, I can take advantage of the compute power when I'm not actually on the machine by setting it up as a LLM server.
For chat type interactions prefill is cached, prompt is processed at 400tk/s and generation is 100-107tk/s, it's quite snappy. Sure, for 130,000 tokens, processing documents it drops to, I think 60tk/s, but don't quote me on that. The larger point is that local LLMs are becoming useful, and they are getting smarter too.
Please read the guidelines and consider moderating your tone. Hostility towards other commenters is strongly discouraged.
I'm not sure if you're just unaware or purposefully dense. It's absolutely possible to get those numbers for certain models in a m4 max and it's averaged over many tokens, I was just getting 127tok/s for 700 token response on a 24b MoE model yesterday. I tend to use Qwen 3 Coder Next the most which is closer to 65 or 70 tok/s, but absolutely usable for dev work.
I think the truth is somewhere in the middle, many people don't realize just how performant (especially with MLX) some of these models have become on Mac hardware, and just how powerful the shared memory architecture they've built is, but also there is a lot of hype and misinformation on performance when compared to dedicated GPU's. It's a tradeoff between available memory and performance, but often it makes sense.
what inference runtime are you using? You mentioned mlx but I didn't think anyone was using that for local llms
LM Studio (which prioritizes MLX models if you're on Mac and they are available) - I have it setup with tailscale running as a server on my personal laptop. So when I'm working I can connect to it from my work laptop, from wherever I might be, and it's integrated through the Zed editor using its built in agent - it's pretty seamless. Then whenever I want to use my personal laptop I just unload the model and do other things. It's a really nice setup, definitely happy I got the 128gb mbp because I do a lot of video editing and 3d rendering work as a hobby/for fun and it's sorta dual purpose in that way, I can take advantage of the compute power when I'm not actually on the machine by setting it up as a LLM server.
LM Studio has had an MLX engine and models since 2024.