Comment by terhechte

3 months ago

To add, they say about the 400B "Maverick" model:

> while achieving comparable results to the new DeepSeek v3 on reasoning and coding

If that's true, it will certainly be interesting for some to load up this model on a private M3 Studio 512GB. Response time will be fast enough for interaction in Roo Code or Cline. Prompt processing is a bit slower but could be manageable depending on how much code context is given to the model.

The upside being that it can be used on codebases without having to share any code with a LLM provider.

4 comments

terhechte

anoncareer0212 3 months ago

Small point of order: bit slower might not set expectations accurately. You noted in a previous post in the same thread[^1] that we'd expect about a 1 minute per 10K tokens(!) prompt processing time with the smaller model. I agree, and contribute to llama.cpp. If anything, that is quite generous.

[^1] https://news.ycombinator.com/item?id=43595888

terhechte 3 months ago
I don't think the time grows linearly. The more context the slower (at least in my experience because the system has to throttle). I just tried 2k tokens in the same model that I used for the 120k test some weeks ago and processing took 12 sec to first token (qwen 2.5 32b q8).
- anoncareer0212 3 months ago
  
  Hmmm, I might be rounding off wrong? Or reading it wrong?
  IIUC the data we have:
  2K tokens / 12 seconds = 166 tokens/s prefill
  120K tokens / (10 minutes == 600 seconds) = 200 token/s prefill
- kgwgk 3 months ago
  
  > The more context the slower
  It seems the other way around?
  120k : 2k = 600s : 10s