Comment by avaer
11 hours ago
It works, I've shipped this as a "local inference"/poor person's ollama for low-end llm tasks like search. The main win is that it's free and privacy preserving, and (mostly) transparent to users in that they don't have to do anything, which is great for giving non-technical users local inference without making them do scary native things.
But keep in mind the actual experience for users is not great; the model download is orders of magnitude greater than downloading the browser itself, and something that needs to happen before you get your first token back. That's unfixable until operating systems start reliably shipping their own prebaked models that an API like this could plug into.
Is it actually privacy preserving? Chrome mostly exists to extract all the information from a user it can without immediately getting a lawsuit of greater penalty than what is gained through ads, military contracts, etc. Android isn't too far off either. I would welcome any alternative to this. I can see applications for this being things like "while device is at rest and charging summarize all of the users recent text communications" or whatever else as a legal loop hole for wiretap laws
>I can see applications for this being things like "while device is at rest and charging summarize all of the users recent text communications" or whatever else as a legal loop hole for wiretap laws
This just exposes an API for sites to use. If they wanted to do the types of spying you're cynically suggesting, they could just add it without an API and you'd be none the wiser. Chrome contains closed source components so you wouldn't even know.
Do you think no-one would notice that the Chrome download was 20GB larger?
It's a lot easier to hide the language they need in a EULA for a feature like this than it would be elsewhere.
I appreciate you feel this is a cynical take. But have you seen the class action lawsuits against Google over the last 5 years? They exceed a billion dollars as far as I can remember and they are for more blatant things than this.
1 reply →
> That's unfixable until operating systems start reliably shipping their own prebaked models that an API like this could plug into.
Maybe the next big thing will be some software subscription premium offers with a bunch of 5090s as an extra.
> But keep in mind the actual experience for users is not great; the model download is orders of magnitude greater than downloading the browser itself, and something that needs to happen before you get your first token back.
With MoE models, you could fetch expert layers from the network on demand by issuing HTTP range queries for the corresponding offset, similar to how bittorrent downloads file chunks from multiple hosts. You'd still have to download shared layers, but time to first token would now be proportional to active-size rather than total-size. Of course this wouldn't be totally "offline" inference anymore, but for a web browser feature that's not a key consideration.
> With MoE models, you could fetch expert layers from the network on demand
This is a common misconception, probably due to the unfortunate naming. Expert layers are not "expert" at any particular subject, and active-size only refers to the activated layers per token. You'd still need all (or most of all) the layers for any particular query, even if some layers have a very low chance of being activated.
All in all, you'd be better off with lazy loading the entire model, at least you'd know you have the capability to run inference from then on.
Ultimately it would amount to lazy-loading the model, but the parameters themselves would be fetched from the network as needed, which still decreases time-to-first-token. It's true that "expert" choices will span most of the model, regardless of any particular "subject" or "topic" choice, but if we simply care about time-to-first-token it's still a viable strategy.
1 reply →
> operating systems start reliably shipping their own prebaked models
Here's to hoping that that dystopia will never happen.
Would it be less dystopian for Operating Systems to ship with their own browser that ships with their own models? Or do you find the current situation where Operating Systems ship with browsers dystopian?
> It works, I've shipped this as a "local inference"/poor person's ollama for low-end llm tasks like search
fantastic!
> the model download is orders of magnitude greater than downloading the browser itself, and something that needs to happen before you get your first token back
sure but does this mean the model is lazily downloaded? that is, if I used this and I am the first time the model was called, the user would be waiting until the model was downloaded at that point?
that sounds like a horrible user experience - maybe chrome reduces the confusion by showing a download dialog status or similar?
also, any idea what the on disk impact is?
The model download is lazy and cached, so it's a one-time cost presumably across all origins (I assume so since the alternative would be a trivial DoS waiting to happen).
So it's once per browser, not once per site.
You can track the download state yourself and pop whatever UI you want.
chrome://on-device-internals reports "Model Name: v3Nano Version: 2025.06.30.1229 Folder size: 4,072.13 MiB" on a random Windows machine I just checked.
Thank You stranger! I would have assumed the size would vary based on whether your hardware supports the high-quality GPU backend (4 GB) or defaults to a smaller CPU-compatible version (3 GB) but the 22GB note on that page is really confusing. Even if it was including the model server where's the remaining 18GB going towards?
2 replies →
> Storage: At least 22 GB of free space on the volume that contains your Chrome profile.
Yes, but that is then followed by:
Lmao and here I am still staunchly treating Blazor’s 2MB runtime as a deal-breaker
2 replies →
> `> Storage: At least 22 GB of free space on the volume that contains your Chrome profile.`
Yes, I can read and comprehend English and you should assume I read the page. Because of the "At least" wording, I was curious what a person who has actually used the feature has noticed, aka, learning from people who have actually done it already.
Doesn't sound great, but consider how much better this is than every webpage trying to load their own models.
If it turns out useful enough I'm sure browsers will just start including it as (perhaps optional?) part of installation.