← Back to context

Comment by rahimnathwani

8 hours ago

For local models are you wanting to do:

A) Embeddings.

B) Things like classification, structured outputs, image labelling etc.

C) Image generation.

D) LLM chatbot for answering questions, improving email drafts etc.

E) Agentic coding.

?

I have a MBP with M1 Max and 32GB RAM. I can run a 20GB mlx_vlm model like mlx-community/Qwen3.5-35B-A3B-4bit. But:

- it's not very fast

- the context window is small

- it's not useful for agentic coding

I asked "What was mary j blige's first album?" and it output 332 tokens (mostly reasoning) and the correct answer.

mlx_vlm reported:

  Prompt: 20 tokens @ 28.5 t/s | Generation: 332 tokens @ 56.0 t/s | Peak memory: 21.67 GB