Ingesting multiple code files will take forever in prompt processing without a GPU though, tg will be the least of your worries. Especially when you don't append but change it in random places so caching doesn't work.
A FIM or completion model like this won't have a large prompt and caching doesn't work anyways (per their notes). It'll get maybe a few thousand tokens in a prompt, maximum. For a 1.5B model, you should expect usable CPU-only inference on a modern CPU, like at least hundreds of tokens per second of prefill and tens of tokens per second of generation, which is decently usable in terms of responsiveness.
A thousand tokens (which would be on the low side) at 10-100 t/s in ingestion speed is 10-100 seconds. I don't seriously expect anyone to wait a solid minute after pressing tab for autocomplete, regular autocomplete gets unusably annoying if it takes more than a split second tbh.
Ingesting multiple code files will take forever in prompt processing without a GPU though, tg will be the least of your worries. Especially when you don't append but change it in random places so caching doesn't work.
A FIM or completion model like this won't have a large prompt and caching doesn't work anyways (per their notes). It'll get maybe a few thousand tokens in a prompt, maximum. For a 1.5B model, you should expect usable CPU-only inference on a modern CPU, like at least hundreds of tokens per second of prefill and tens of tokens per second of generation, which is decently usable in terms of responsiveness.
A thousand tokens (which would be on the low side) at 10-100 t/s in ingestion speed is 10-100 seconds. I don't seriously expect anyone to wait a solid minute after pressing tab for autocomplete, regular autocomplete gets unusably annoying if it takes more than a split second tbh.
Unfortunately, the main optimization (3x speedup) is using n-gram spec dec which doesn't run on CPUs. But I believe it works on Metal at least.