Comment by spijdar

3 days ago

Given the MTP drafter is basically a separate model, keeping it separate makes more sense IMO. It's out of my wheelhouse but it seems like you could adjust the MTP drafter model separately from the main model, too.

Ultimately though the real explanation, I think, is Google doesn't care since for their own purposes (in LiteRT-LM), they do bundle them. As far as I know, anyway.

5 comments

spijdar

DiabloD3 3 days ago

MTP models share internal state with the main model, and also refer to parameters in the model.

They are more like a single model that has two separate attention head mechanisms.

girvo 3 days ago

Being grafted onto the main model reduces layer duplication that you’d otherwise have: at least for Step and Qwen 3.6

alfiedotwtf 2 days ago
Step 2.7’s MTP seems broken (at least for ik_llama.cpp) where the draft model starts and ends in block 3 but ik_llama bails out looking for block 0 :(
- girvo 2 days ago
  
  Aw that’s a shame; I’m running the official llama.cpp on my Spark-alike, and it works great now. Proper triple head too which is what it is trained on, gets me up to 35-40tk/s decode

anaisbetts 3 days ago

I mean just like GGUFs aren't technically necessary yet are _way_ more convenient than using Safetensors and configuring the default Jinja prompt by-hand, it makes sense to bundle the draft model too. For all intents and purposes, the only people who will train a draft model are the people who train the original model