Comment by mercutio2
5 hours ago
oMLX makes prefill effectively instantaneous on a Mac.
Storing an LRU KV Cache of all your conversations both in memory, and on (plenty fast enough) SSD, especially including the fixed agent context every conversation starts with, means we go from "painfully slow" to "faster than using Claude" most of the time. It's kind of shocking this much perf was lying on the ground waiting to be picked up.
Open models are still dumber than leading closed models, especially for editing existing code. But I use it as essentially free "analyze this code, look for problem <x|y|z>" which Claude is happy to do for an enormous amount of consumed tokens.
But speed is no longer a problem. It's pretty awesome over here in unified memory Mac land :)
No comments yet
Contribute on Hacker News ↗