← Back to context

Comment by cold_harbor

2 hours ago

their MLA architecture cuts KV cache by ~5-13x vs standard attention. that's why inference is actually cheaper to run, not just a price war to gain market share.

Yes. The discount was most likely a "post-market trial" of how efficient the caching works for the new generation models.

That's also a game changer for local inference. It unlocks long contexts, batched inference and storing the KV cache to disk on ordinary consumer platforms.