← Back to context

Comment by tarruda

18 hours ago

There is a newer PR which will probably be merged soon: https://github.com/ggml-org/llama.cpp/pull/22673

Ohhhh geee!!! I just applied the patch to my local git copy. You need to use the model on the PR that he submitted, the model is particular because it has extra information that allows the MTP to happen. I have two amd gpus, and qwen3.6 27B qk6 does around 20t/s generation... If I run it only on one I get like 35t/s.

But with this patch I saw 46t/s with qwen3.6 27B q8... this is insane, it's 250% faster than the original speed, there was no gpu I could upgrade to get that kind of boost, amazing!

Ollama merged a PR for MTP about 2 hours ago, as well:

https://github.com/ollama/ollama/pull/15980

Edit: Seems they also have a pre-release version out with the functionality added: https://github.com/ollama/ollama/releases/tag/v0.23.1-rc0

  • Sad:

    theturtle32@ai1:~$ ollama run gemma4:31b-coding-mtp-bf16 pulling manifest Error: pull model manifest: 412: this model requires macOS

    • What's "sad" is how slow the ollama folks are being in vendoring newer versions of ggml into their codebase. That attitude just leaves them stranded without access to newer features.