Comment by coder543

19 hours ago

MTP requires a separate KV cache, so there is more memory overhead than just the weights of the MTP model, but it's a manageable amount.

2 comments

coder543

a_e_k 19 hours ago

From the linked post, it didn't read like a separate KV cache was needed:

> The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out.

coder543 19 hours ago

That's great news. That has not been the case with other MTP implementations like Qwen3.5, but I see the section in the article saying Google introduced some architectural optimizations to make this possible.