Comment by anonym29

5 hours ago

The problem with M2.7 is that it's full GQA, meaning quadratic attention. It does start fast, but by 64k tokens deep, the version I'm running (Unsloth's UD IQ2_XXS) pp512 drops 95% from 261.3 t/s (0 context depth) to 13.1 t/s. q8_0 KV does help, still hitting 57.4 t/s at 64k depth vs 258.3 t/s at 0 depth. TG's retention rates are better, but still approaching single digit even with q8_0 KV cache by 64k depth.

That said, it was my favorite model when I valued output quality above all else, at least up until the new Qwen 3.6 27B, which I'm currently playing with.

I suspect I will like Qwen 3.6 122B A10B a LOT, maybe even better than M2.7.

0 comments

anonym29

No comments yet

Contribute on Hacker News ↗