Comment by vichle

17 days ago

What type of hardware do I need to run a small model like this? I don't do Apple.

8 comments

vichle

1.5B models can run on CPU inference at around 12 tokens per second if I remember correctly.

moffkalast 17 days ago
Ingesting multiple code files will take forever in prompt processing without a GPU though, tg will be the least of your worries. Especially when you don't append but change it in random places so caching doesn't work.
- bradfa 16 days ago
  
  A FIM or completion model like this won't have a large prompt and caching doesn't work anyways (per their notes). It'll get maybe a few thousand tokens in a prompt, maximum. For a 1.5B model, you should expect usable CPU-only inference on a modern CPU, like at least hundreds of tokens per second of prefill and tens of tokens per second of generation, which is decently usable in terms of responsiveness.
  
  1 reply →
kevinlu1248 16 days ago

Unfortunately, the main optimization (3x speedup) is using n-gram spec dec which doesn't run on CPUs. But I believe it works on Metal at least.

1.54GB model? You can run this on a raspberry pi.

BoredomIsFun 16 days ago

Performance of LLM inference consists of two independent metrics - prompt processing (compute intensive) and token generation (bandwidth intensive). For autocomplete with 1.5B you can get away with abysmal 10 t/s token generation performance, but you'd want as fast as possible prompt processing, pi in incapable of.
gunalx 16 days ago

if you mean on the new ai hat with npu and integrated 8gb memory, maybe.