Comment by cold_harbor

2 hours ago

their MLA architecture cuts KV cache by ~5-13x vs standard attention. that's why inference is actually cheaper to run, not just a price war to gain market share.

3 comments

cold_harbor

vitorsr 3 minutes ago

Yes. The discount was most likely a "post-market trial" of how efficient the caching works for the new generation models.

zozbot234 2 hours ago

That's also a game changer for local inference. It unlocks long contexts, batched inference and storing the KV cache to disk on ordinary consumer platforms.

hmaddipatla 1 hour ago

[dead]