← Back to context

Comment by pseudosavant

2 days ago

I don't know how many others here have a CoPilot+ PC but the NPU on it is basically useless. There isn't any meaningful feature I get by having that NPU. They are far too limited to ever do any meaningful local LLM inference, image processing or generation. It handles stuff like video chat background blurring, but users' PC's have been doing that for years now without an NPU.

I'd love to see a thorough breakdown of what these local NPUs can really do. I've had friends ask me about this (as the resident computer expert) and I really have no idea. Everything I see advertised for (blurring, speech to text, etc...) are all things that I never felt like my non-NPU machine struggled with. Is there a single remotely killer application for local client NPUs?

  • I used to work at Intel until recently. Pat Gelsinger (the prior CEO) had made one of the top goals for 2024 the marketing of the "AI PC".

    Every quarter he would have an all company meeting, and people would get to post questions on a site, and they would pick the top voted questions to answer.

    I posted mine: "We're well into the year, and I still don't know what an AI PC is and why anyone would want it instead of a CPU+GPU combo. What is an AI PC and why should I want it?" I then pointed out that if a tech guy like me, along with all the other Intel employees I spoke to, cannot answer the basic questions, why would anyone out there want one?

    It was one of the top voted questions and got asked. He answered factually, but it still wasn't clear why anyone would want one.

  • The problem is essentially memory bandiwdth afiak. Simplifying a lot in my reply, but most NPUs (all?) do not have faster memory bandwidth than the GPU. They were originally designed when ML models were megabytes not gigabytes. They have a small amount of very fast SRAM (4MB I want to say?). LLM models _do not_ fit into 4MB of SRAM :).

    And LLM inference is heavily memory bandwidth bound (reading input tokens isn't though - so it _could_ be useful for this in theory, but usually on device prompts are very short).

    So if you are memory bandwidth bound anyway and the NPU doesn't provide any speedup on that front, it's going to be no faster. But has loads of other gotchas so no real "SDK" format for them.

    Note the idea isn't bad per se, it has real efficiencies when you do start getting compute bound (eg doing multiple parallel batches of inference at once), this is basically what TPUs do (but with far higher memory bandwidth).

    • NPUs are still useful for LLM pre-processing and other compute-bound tasks. They will waste memory bandwidth during LLM generation phase (even in the best-case scenario where they aren't physically bottlenecked on bandwidth to begin with, compared to the iGPU) since they generally have to read padded/dequantized data from main memory as they compute directly on that, as opposed to being able to unpack it in local registers like iGPUs can.

      > usually on device prompts are very short

      Sure, but that might change with better NPU support, making time-to-first-token quicker with larger prompts.

      11 replies →

  • In theory NPUs are a cheap, efficient alternative to the GPU for getting good speeds out of larger neural nets. In practice they're rarely used because for simple tasks like blurring, speech to text, noise cancellation, etc you can get usually do it on the CPU just fine. For power users doing really hefty stuff they usually have a GPU anyway so that gets used because it's typically much faster. That's exactly what happens with my AMD AI Max 395+ board. I thought maybe the GPU and NPU could work in parallel but memory limitations mean that's often slower than just using the GPU alone. I think I read that their intended use case for the NPU is background tasks when the GPU is already loaded but that seems like a very niche use case.

    • If the NPU happens to use less power for any given amount of TOPS it's still a win since compute-heavy workloads are ultimately limited by power and thermals most often, especially on mobile hardware. That frees up headroom for the iGPU. You're right about memory limitations, but these are generally relevant for e.g. token generation not prefill.

  • > Everything I see advertised for (blurring, speech to text, etc...) are all things that I never felt like my non-NPU machine struggled with.

    I don’t know how good these neural engines are, but transistors are dead-cheap nowadays. That makes adding specialized hardware a valuable option, even if it doesn’t speed up things but ‘only’ decreases latency or power usage.

  • I think a lot of it is just power savings on those features, since the dedicated silicon can be a lot more energy efficient even if it's not much more powerful.

  • "WHAT IS MY PURPOSE?"

    "You multiply matrices of INT8s."

    "OH... MY... GOD"

    NPUs really just accelerate low-precision matmuls. A lot of them are based on systolic arrays, which are like a configurable pipeline through which data is "pumped" rather than a general purpose CPU or GPU with random memory access. So they're a bit like the "synergistic" processors in the Cell, in the respect that they accelerate some operations really quickly, provided you feed them the right way with the CPU and even then they don't have the oomph that a good GPU will get you.

    • My question is: Isn't this exactly what SIMD has done before? Well, or SSE2 instructions?

      To me, an NPU and how it's described just looks like a pretty shitty and useless FPGA that any alternative FPGA from Xilinx could easily replace.

      2 replies →

    • So it's a higher power DSP style device. Small transformers for flows. Sounds good for audio and maybe tailored video flow processing.

I have one as well and I simply don’t get it. I lucked into being able to do somewhat acceptable local LLM’ing by virtue of the Intel integrated “GPU” sharing VRAM and RAM, which I’m pretty sure wasn’t meant to be the awesome feature it turned out to be. Sure, it’s dead slow, but I can run mid size models and that’s pretty cool for an office-marketed HP convertible.

(it’s still amazing to me that I can download a 15GB blob of bytes and then that blob of bytes can be made to answer questions and write prose)

But the NPU, the thing actually marketed for doing local AI just sits there doing nothing.

Also the Copilot button/key is useless. It cannot be remapped to anything in Ubuntu because it sends a sequence of multiple keycodes instead if a single keycode for down and then up. You cannot remap it to a useful modifier or anything! What a waste of keyboard real estate.

  • If you want a small adventure, you could see which HID device those keystrokes show up on, and they might be remappable courtesy of showing up on a HID device for that specific button. Failing that, they most likely come from either ACPI AML code or from the embedded controller (EC). If the former, it’s not that hard to patch the AML code, and maybe Copilot could do it for you (you use standard open source tooling to disassemble the AML blob, which the kernel will happily give you, and then you make a patched version and load it). If the latter, you could see if anyone has made progress toward finding a less silly way to configure the EC.

    (The EC is a little microcontroller programmed by the OEM that does things like handling weird button presses.)

    There are also reports of people having decent results using keyd to remap the synthetic keystrokes from the copilot button.

    (The sheer number of times Microsoft has created totally different specs for how OEMs should implement different weird buttons is absurd.)

If I had to steelman Dell, they probably made a bet a while ago that the software side would have something for the NPU, and if so they wanted to have a device to cash in on it. The turnaround time for new hardware was probably on the order of years (I could be wrong about this).

It turned out to be an incorrect gamble but maybe it wasn’t a crazy one to make at the time.

There is also a chicken and egg problem of software being dependent on hardware, and hardware only being useful if there is software to take advantage of its features.

That said I haven’t used Windows in 10 years so I don’t have a horse in this race.

  • > There is also a chicken and egg problem of software being dependent on hardware, and hardware only being useful if there is software to take advantage of its features.

    In the 90s, as a developer you couldn't depend on that a user's computer had a 3D accelerator (or 3D graphics) card. So 3D video games used multiple renderers (software rendering, hardware-accelerated rendering (sometimes with different backends like Glide, OpenGL, Direct3D)).

    Couldn't you simply write some "killer application" for local AI that everybody "wants", but which might be slow (even using a highly optimized CPU or GPU backend) if you don't have an NPU. Since it is a "killer application", very many people will still want to run it, even if the experience is slow.

    Then as a hardware vendor, you can make the big "show-off" how much better the experience is with an NPU (AI PC) - and people will immediately want one.

    Exactly the same story as for 3D accelerators and 3D graphics card where Quake and Quake II were such killer applications.

  • They are still including the NPU though, they just realised that consumers aren't making laptop purchases based on having "AI" or being branded with Copilot.

    The NPU will just become a mundane internal component that isn't marketed.

What we want as developers: To be able to implement functionality that utilizes a model for tasks like OCR, visual input and analysis, search or re-ranking etc, without having to implement an LLM API and pay for it. Instead we'd like to offer the functionality to users, possibly at no cost, and use their edge computing capacity to achieve it, by calling local protocols and models.

What we want as users: To have advanced functionality without having to pay for a model or API and having to auth it with every app we're using. We also want to keep data on our devices.

What trainers of small models want: A way for users to get their models on their devices, and potentially pay for advanced, specialized and highly performant on-device models, instead of APIs.

  • What seems to be delivered by NPUs at this point: filtering background noise from our microphone and blurring our camera using a watt or two less than before.

The idea is that NPUs are more power efficient for convolutional neural network operations. I don't know whether they actually are more power efficent, but it'd be wrong to dismiss them just because they don't unlock new capabilties or perform well for very large models. For smaller ML applications like blurring backgrounds, object detection, or OCR, they could be beneficial for battery life.

  • Yes, the idea before the whole shove LLMs into everything era was that small, dedicated models for different tasks would be integrated into both the OS and applications.

    If you're using a recent phone with a camera, it's likely using ML models that may or may not be using AI accelerators/NPUs on the device itself. The small models are there, though.

    Same thing with translation, subtitles, etc. All small local models doing specialized tasks well.

    • OCR on smartphones is a clear winner in this area. Stepping back, it's just mind blowing how easy it is to take a picture of text and then select it and copy and paste it into whatever. And I totally just take it for granted.

  • Not sure about all NPUs, but TPUs like Google's Coral accelerator are absolutely, massively more efficient per watt than a GPU, at least for things like image processing.

I did some research on if the transistor budget for the NPU was spent on something else in the SoC/CPU, what could you get?

You could have 4-10 additional CPU cores, or 30-100MB more L3 cache. I would definitely rather have more cores or cache, than a slightly more efficient background blurring engine.

NPUs overall need better support from local AI frameworks. They're not "useless" for what they can do (low-precision bulk compute, which is potentially relevant for many of the newer models) and they could help address thermal limits due to their higher power efficiency compared to CPU/iGPU. but that all requires specialized support that hasn't been coming.

Yeah, that's because the original npus were a rush job, the amd AI Max is the only one that's worth anything in my opinion.

  • I have a Strix Halo 395 128GB laptop running Ubuntu from HP. I have not been able to do anything with the NPU. I was hoping it could be used for OpenCL, but does not seem so.

    What examples do you have of making the NPU in this processor useful please?

    • All the videos I've seen of AI workloads with an AMD Strix Halo with 128GB setup have used the GPU for the processing. It has a powerful iGPU and unified memory more like Apple's M chips.

  • Is that because of the actual processing unit or because they doubled the width of the memory bus?

    • It's because it comes with a decent iGPU, not because of the NPU inside of that. The NPU portion is still the standard tiny 50 TOPS and could be filled with normal RAM bandwidth like on a much cheaper machine.

      On the RAM bandwidth side it depends if you want to look at it as "glass is half full" or "glass is half empty". For "glass is half full" the GPU has access to a ton of RAM at ~2x-4x the bandwidth of normal system memory an iGPU would have and so you can load really big models. For "glass is half empty" that GPU memory bandwidth is still nearly 2x less than a even a 5060 dGPU (which doesn't have to share any of that bandwidth with the rest of the system), but you won't fit as large of a model on a dGPU and it won't be as power efficient.

      Speaking of power efficiency - it is decently power efficient... but I wouldn't run AI on battery on mine unless I was plugged in anyways as it still eats through the battery pretty quick when doing so. Great general workstation laptop for the size and wattage though.

If you do use video chat background blurring, the NPU is more efficient at it vs using your cpu or gpu. So the feature it supports is longer battery life, and less resource usage on your main chips, and better performance for the things that NPUs can do. E.g higher video quality on your blurred background.

  • Really, the best we can do with the NPU is a less battery intensive blurred background? R&D money well spent I guess...

The stacks for consumer NPUs are absolutely cursed, this does not surprise me.

They (Dell) promised a lot in their marketing, but we're like several years into the whole Copilot PC thing and you still can barely, if at all, use sane stacks with laptop NPUs.

NPUs were pushed by Microsoft, who saw the writing on the wall: AI like chatgpt will dominate the user's experience, edge computing is a huge advantage in that regard, and Apple's hardware can do it. NPUs are basically Microsoft trying to fudge their way to a llamacpp-on-Apple-Silicon experience. Obviously it failed, but they couldn't not try.

  • > NPUs were pushed by Microsoft, who saw the writing on the wall: AI like chatgpt will dominate the user's experience, edge computing is a huge advantage in that regard

    Then where is a demo application from Microsoft of a model that I can run locally where my user experience is so much better (faster?) if my computer has an NPU?

  • I think the reason why NPUs failed is that Microsoft's preferred standard ONNX and the runtime they developed is a dud. Exporting models to work on ONNX is a pain in the ass.

  • > AI like chatgpt will dominate the user's experience

    I hope not. Sure they’re helpful, but I’d rather they sit idle behind the scenes, and then only get used when a specific need arises rather than something like a Holodeck audio interface

The NPU is essentially the Sony Cell "SPE" coprocessor writ large.

The Cell SPE was extremely fast but had a weird memory architecture and a small amount of local memory, just like the NPU, which makes it more difficult for application programmers to work with.

The Copilot Runtime APIs to utilize the NPU are still experimental and mostly unavailable. I can't believe an entire generation of the Snapdragon X chip came and went without working APIs. Truly incredible.

If you do use video chat background blurring, the NPU is more efficient at it vs using your cpu or gpu. So the feature it supports is longer battery life and less resource usage on your main chips.

  • I'm not too familiar with the NPU, but this sounds a lot like GPU acceleration where a lot of the time you still end up having everything run on the CPU since it just works everywhere all the time rather than having to have both a CPU and an NPU version.

I've got one anecdote: friend needed Live Captions for a translating job and had to get copilot+ PC just for that.

  • What software are they using for that, and how did they know ahead of time that the software would use their NPU?

Question - from the perspective of the actual silicon, are these NPUs just another form of SIMD? If so, that's laughable sleight of hand and the circuits will be relegated to some mothball footnote in the same manner as AVX512, etc.

To be fair, SIMD made a massive difference for early multimedia PCs for things like music playback, gaming, and composited UIs.