Comment by cold_harbor
2 hours ago
their MLA architecture cuts KV cache by ~5-13x vs standard attention. that's why inference is actually cheaper to run, not just a price war to gain market share.
2 hours ago
their MLA architecture cuts KV cache by ~5-13x vs standard attention. that's why inference is actually cheaper to run, not just a price war to gain market share.
Yes. The discount was most likely a "post-market trial" of how efficient the caching works for the new generation models.
That's also a game changer for local inference. It unlocks long contexts, batched inference and storing the KV cache to disk on ordinary consumer platforms.
[dead]