Comment by zozbot234
11 hours ago
The typical NPU is only marginally helpful for on-prem inference. A GPU can read quantized data from main memory and dequantize/pad it locally (making effective use of memory throughput); a NPU often needs to read padded data directly from memory, which is wasteful. So it only helps a little bit wrt. prefill.
Also, smaller models can obviously be used but a smaller model will be a lot weaker in real-world knowledge and this tends to limit their smarts in a way that can't be compensated by more thinking.
No comments yet
Contribute on Hacker News ↗