Comment by tiku

3 months ago

Personally I'm so disappointed about the state of local AI. Only old models run "decent" but decent is way to slow to be usable.

This is exactly the problem we're trying to solve. The models themselves have gotten surprisingly capable at small sizes, Qwen3.5 4B with 262K context, LFM2 1.2B for fast tool calling, but the inference infrastructure hasn't kept up.

When people say "local AI is too slow," they usually mean the engine is too slow, not the model. A 4B model at 186 tok/s (MetalRT on M4 Max) feels genuinely responsive for interactive chat. The same model at 87 tok/s (llama.cpp) feels sluggish. Same weights, same quality, 2x the speed, that's a usability cliff.

We think the gap between cloud and on-device inference is a infrastructure problem, not a model problem. That's what we're working on.