Comment by mtw

10 months ago

how much RAM do you have? I want to compare with my local setup (M4 Pro)

14 comments

mtw

I have a MBP M1 Max 64GB and I get 40t/s with llama.cpp and unsloth q4_k_m on the 30B A3B model. I always use /nothink and Temperature=0.7, TopP=0.8, TopK=20, and MinP=0 - these are the settings recommended for Qwen3 and they make a big difference. With the default settings from llama-server it will always run into an endless loop.

The quality of the output is decent, just keep in mind it is only a 30B model. It also translates really well from french to german and vice versa, much better than Google translate.

Edit: for comparision, Qwen2.5-coder 32B q4 is around 12-14t/s on this M1 which is too slow for me. I usually used the Qwen2.5-coder 17B at around 30t/s for simple tasks. Qwen3 30B is imho better and faster.

[1] parameters for Qwen3: https://huggingface.co/Qwen/Qwen3-30B-A3B

[2] unsloth quant: https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF

[3] llama.cpp: https://github.com/ggml-org/llama.cpp

omneity 10 months ago

128GB but it's not using much.

I'm running Q4 and it's taking 17.94 GB VRAM with 4k context window, 20GB with 32k tokens.

A4ET8a8uTh0_v2 10 months ago
I am not a mac person, but I am debating buying one for the unified ram now that the prices seem to be inching down. Is it painful to set up? The general responses I seem to get range from "It is takes zero effort" to "It was a major hassle to set everything up."
- simonw 10 months ago
  
  LM Studio and Ollama are both very low complexity ways to get local LLMs running on a Mac.
  As a Python person I've found uv + MLX to be pretty painless on a Mac too.
- dghlsakjg 10 months ago
  
  Read the article you are commenting on. It is a how to that answers your exact question. It takes 4 commands in the terminal.
- PhilippGille 10 months ago
  
  > I am not a mac person, but I am debating buying one for the unified ram
  Soon some AMD Ryzen AI Max PCs will be available, with unified memory as well. For example the Framework Desktop with up to 128 GB, shared with the iGPU:
  - Product: https://frame.work/us/en/desktop?tab=overview
  - Video, discussing 70B LLMs at around 3m:50s : https://youtu.be/zI6ZQls54Ms
  
  1 reply →
- MR4D 10 months ago
  
  You can use the method in this tutorial or you can download LM Studio and run it.
  The latter is super easy. Just download the model (thru the GUI) and go.
- bloqs 10 months ago
  
  The article should answer your question. Or do you mean setting up a Mac for use as a Linux or windows user
  
  3 replies →
- avetiszakharyan 10 months ago
  
  Honestly, it is quite a hastle, took me 2 hours BUT. if you just take the whole article text and paste that to gemini-2.5-pro and give your circumstance, i think it will give you specific steps for your case and it should be trivial from that moment on
- emmelaich 10 months ago
  
  Using llama.cpp or pytorch could hardly be easier.