Comment by refulgentis

5 days ago

I'm so darn confused on local LLMs and M-series inference speed, the perf jump from M2 Max to M4 Max was negligible, 10-20%. (both times MBP, 64 GB and max gpu cores)

3 comments

refulgentis

PeterStuer 4 days ago

Does your inference framework target the NPU or just GPU/CPU?

refulgentis 4 days ago
It's linking llama.cpp and using Metal, so I presume GPU/CPU only.
I'm more than a bit overwhelmed with what I've gotten on my plate and have completely missed the boat on ex. understanding what MLX is, really curious for a thought dump if you have some opinionated experience/thoughts here. (ex. never crossed my mind until now that you might get better results on the NPU than GPU)
- PeterStuer 3 days ago
  
  LMstudio seems to have MLX support on Apple silicon so you could quickly have a feel for whether it helps in your case https://github.com/lmstudio-ai/mlx-engine