Ingesting multiple code files will take forever in prompt processing without a GPU though, tg will be the least of your worries. Especially when you don't append but change it in random places so caching doesn't work.
A FIM or completion model like this won't have a large prompt and caching doesn't work anyways (per their notes). It'll get maybe a few thousand tokens in a prompt, maximum. For a 1.5B model, you should expect usable CPU-only inference on a modern CPU, like at least hundreds of tokens per second of prefill and tens of tokens per second of generation, which is decently usable in terms of responsiveness.
Performance of LLM inference consists of two independent metrics - prompt processing (compute intensive) and token generation (bandwidth intensive). For autocomplete with 1.5B you can get away with abysmal 10 t/s token generation performance, but you'd want as fast as possible prompt processing, pi in incapable of.
1.5B models can run on CPU inference at around 12 tokens per second if I remember correctly.
Ingesting multiple code files will take forever in prompt processing without a GPU though, tg will be the least of your worries. Especially when you don't append but change it in random places so caching doesn't work.
A FIM or completion model like this won't have a large prompt and caching doesn't work anyways (per their notes). It'll get maybe a few thousand tokens in a prompt, maximum. For a 1.5B model, you should expect usable CPU-only inference on a modern CPU, like at least hundreds of tokens per second of prefill and tens of tokens per second of generation, which is decently usable in terms of responsiveness.
1 reply →
Unfortunately, the main optimization (3x speedup) is using n-gram spec dec which doesn't run on CPUs. But I believe it works on Metal at least.
1.54GB model? You can run this on a raspberry pi.
Performance of LLM inference consists of two independent metrics - prompt processing (compute intensive) and token generation (bandwidth intensive). For autocomplete with 1.5B you can get away with abysmal 10 t/s token generation performance, but you'd want as fast as possible prompt processing, pi in incapable of.
if you mean on the new ai hat with npu and integrated 8gb memory, maybe.