Comment by dragonwriter

9 hours ago

Well, since its kv-cache that TurboQuant optimizes, it means five times bigger context fits into RAM, all other things being equal, not a five times bigger model. But, sure, with any given context size and the same RAM available, you can instead fit a bigger model—which also takes more compute to get the same performance.

Anything that increases the necessary compute to fully utilize RAM bandwidth in optimal LLM serving weakens Apples advantage for that.