This model is fully compatible with anything previously done with gemma3.
Just passed it to one of my vlm fine-tuning scripts and it started without issues (hf transformer code).
On a single GPU with Lora the E4B model takes 18Gb of VRAM in batch size 1 where gemma-4B was 21Gb.
Nice one from deepmind, the gemma3 family tops the open weights VLLMs.
I tried my "Generate an SVG of a pelican riding a bicycle" prompt against Gemma 3n 7.5GB from Ollama and 15GB for mlx-vlm and got a pleasingly different result for the two quantization sizes: https://simonwillison.net/2025/Jun/26/gemma-3n/
It was supposed to be a joke. But weirdly it turns out there's a correlation between how good a model is and how good it as at my stupid joke benchmark.
For me, it shows if LLM are generalising from their training data. LLM understand all of the words in the prompt. they understand the spec for svg better than any human. They know what a bird is. They know what a bike is. They know how to draw (and given access to computer-use could probably ace this test). They can plan and execute on those plans.
Everything here should be trivial for LLM, but they’re quite poor at it because there’s almost no “how to draw complex shapes in svg” type content in their training set.
"Gemini Nano allows you to deliver rich generative AI experiences without needing a network connection or sending data to the cloud." -- replace Gemini with Gemma and the sentence still valid.
Licensing. You can't use Gemini Nano weights directly (at least commercial ly) and must interact with them through Android MLKit or similar Google approved runtimes.
You can use Gemma commercially using whatever runtime or framework you can get to run it.
It's not even clear you can license language model weight though.
I'm not a lawyer but the analysis I've read had a pretty strong argument that there's no human creativity involved in the training, which is an entirely automatic process, and as such it cannot be copyrighted in any way (the same way you cannot put a license on a software artifact just because you compiled it yourself, you must have copyright ownership on the source code you're compiling).
Are you sure? On a quick look, it appears to use its own bespoke license, not the Apache 2.0 license. And that license appears to have field of use restrictions, which means it would not be classified as an open source license according to the common definitions (OSI, DFSG, FSF).
I suspect the difference is in the training data. Gemini is much more locked down and if it tries to repeat something from the draining data verbatim you will get a 'recitation error'.
Gemma 4? I feel that one was incredibly obvious. Let us please just increase the version numbers.
Anthropic is better about this, but then shifted their ordering with the v4 models. Arguably better, but still quite annoying since everything pre-4 uses a different naming scheme.
Tried the E4B model in Ollama and it's totally broken when interpreting images. The output depends only on the text and is consistent in that way, but otherwise completely wrong.
Works fine with regular Gemma 3 4B, so I'll assume it's something on Ollama's side. edit: yep, text-only for now[1], would be nice if that was a bit more prominent than burried in a ticket...
Don't feel like compiling llama.cpp myself, so I'll have to wait to try your GGUFs there.
I'd genuinely like to know how these small models are useful for anyone. I've done a lot of experimenting, and anything smaller than 27B is basically unusable, except as a toy. All I can say for smaller models is that they sometimes produce good answers, which is not enough for anything except monkeying around.
I solved my spam problem with gemma3:27b-it-qat, and my benchmarks show that this is the size at which the current models start becoming useful.
Qwen2.5-VL 7B is pretty impressive at turning printed or handwritten maths lecture notes into Latex code, and is small enough to run slowly on a laptop without enough VRAM. Gemma3 4B was useless at this though, and got stuck in loops or tried to solve the maths problems instead of just converting the working to Latex (but it was much faster as it fit into VRAM).
It sounds like you're trying to use them like ChatGPT, but I think that's not what they're for.
I am sure as ideation devices these can work fine. I treat this more like basic infra. I would absolutely love the future where most phones have some small LLM built in, kind of like a base layer of infra
There are use cases where even low accuracy could be useful. I can't predict future products, but here are two that are already in place today:
- On the keyboard on iphones some sort of tiny language model suggest what it thinks are the most likely follow up words when writing. You only have to pick a suggested next word if it matches what you were planning on typing.
- Speculative decoding is a technique which utilized smaller models to speed up the inference for bigger models.
I'm sure smart people will invent other future use cases too.
The best use case I've found for tiny models (<5bn params) as a reference tool for when I don't have WiFi. I've been using qwen on my MacBook Air as a replacement for Google while I'm writing code on flights. They work great for asking basic questions about syntax and documentation.
Tiny, 4b or less models are designed for finetuning for some narrow tasks; this way can outperform large commercial models for a tiny fraction of price. Also great for code autocomplete.
7b-8b are great coding assistants if all you need is dumb fast refactoring, that cannot quite be done with macros and standard editor functionality but still primitive, such as "rename all methods having at least one argument of type SomeType by prefixing their names with "ST_".
12b is a threshold where models start writing coherent prose such Mistral Nemo or Gemma 3 12b.
anyone know how much it costs to use the deployed version of gemma 3n? The docs indicate you can use the gemini api for deployed gemma 3n but the pricing page just shows "unavailable"
I read the general parts and skimmed the inner workings but I can't figure out what the high-level news is. What does this concretely do that Gemma didn't already do, or what benchmark/tasks did it improve upon?
Until it goes into the inner details (MatFormer, per-layer embeddings, caching...), the only sentence I've found that concretely mentions a new thing is "the first model under 10 billion parameters to reach [an LMArena score over 1300]". So it's supposed to be better than other models until those that use 10GB+ RAM, if I understand that right?
Huh? I'm pretty sure I ran Gemma on my phone last month. Or is there a difference between downloadable (you get the weights because it's necessary to run the thing) and "open" weights?
Whats are some use cases for these local small models, for individuals? Seems like for programming related work, the proprietary models are significantly better and that's all I really use LLMs for personally.
Though I can imagine a few commercial applications where something like this would be useful. Maybe in some sort of document processing pipeline.
For me? Handling data like private voice memos, pictures, videos, calendar information, emails, some code etc. Stuff I wouldn't want to share on the internet / have a model potential slurp up and regurgitate as part of its memory when the data is invariably used in some future training process.
I’m thinking about building a pipeline to mass generate descriptions for the images in my photo collection, to facilitate search. Object recognition in local models is already pretty good, and perhaps I can pair it with models to recognize specific people by name as well.
Suppose I'd like to use models like this one to perform web searches. Is there anything available in the open-source world that would let me do that without much tinkering needed?
I think it’s something that even Google should consider: publishing open-source models with the possibility of grounding their replies in Google Search.
Unfortunately 100 queries per day is quite low for LLMs, which tend to average 5-10 searches per prompt in my experience. And paying for the search API doesn’t seem to be worth it compared to something like a ChatGPT subscription.
It depends on your idea of decent speeds and what you would use it for. I just tried it on a laptop with an AMD HX 370 running on battery in power save mode and it's not especially impressive, although it runs much better in balanced or performance mode. I gave it the prompt "write a fizzbuzz program in rust" and it took almost a minute and a half. I expect it to be pretty terrible on an SBC. Your best bet is to try it out on the oldest hardware you have and figure out if you can tolerate worse performance.
I've been playing around with E4B in AI Studio and it has been giving me really great results, much better than what you'd expect from an 8B model. In fact I'm thinking of trying to install it on a VPS so I can have an alternative to pricy APIs.
Well, see it the other way, there is something positive: commenters here on HN claim that AI is useless. You can now also join the bandwagon of people who have free time.
It seems way worse than other small models, including responding with complete non sequiturs. I think my favorite small model is still DeepSeek distilled with Llama 8B.
Anyone have any idea on the viability of running this on a Pi5 16GB? I have a few fun ideas if this can handle working with images (or even video?) well.
The 4-bit quant weighs 4.25 GB and then you need space for the rest of the inference process. So, yeah you can definitely run the model on a Pi, you may have to wait some time for results.
You're right, I should have checked the model settings. For some reason the default model profile in Ollama had temperature set to 0. Changing the temperature and repeat penalty worked much better than it did when I tried to correct similar behavior in the smallest phi4 reasoning model.
Somethings really screwy with on-device models from Google, I can't put my finger on what, and I think being ex-Google is screwing with my ability to evaluate.
Cherry-picking something that's quick to evaluate:
"High throughput: Processes up to 60 frames per second on a Google Pixel, enabling real-time, on-device video analysis and interactive experiences."
If I download it, run it on Pixel Fold, actual 2B model which is half the size of the ones the 60 fps claim is made for, it takes 6.2-7.5 seconds to begin responding (3 samples, 3 diff photos). Generation speed is shown at 4-5 tokens per second, slightly slower than what llama.cpp does on my phone. (I maintain an AI app that inter alia, wraps llama.cpp on all platforms)
So, *0.16* frames a second, not 60 fps.
The blog post is so jammed up with so many claims re: this is special for on-device and performance that just...seemingly aren't true. At all.
- Are they missing a demo APK?
- Was there some massive TPU leap since the Pixel Fold release?
- Is there a lot of BS in there that they're pretty sure won't be called out in a systematic way, given the amount of effort it takes to get this inferencing?
- I used to work on Pixel, and I remember thinking that it seemed like there weren't actually public APIs for the TPU. Is that what's going on?
In any case, either:
A) I'm missing something, big or
B) they are lying, repeatedly, big time, in a way that would be shown near-immediately when you actually tried building on it because it "enables real-time, on-device video analysis and interactive experiences."
Everything I've seen the last year or two indicates they are lying, big time, regularly.
But if that's the case:
- How are they getting away with it, over this length of time?
- How come I never see anyone else mention these gaps?
I agree that's the most likely interpretation - does it read as a shell game to you? Like, it can do that but once you get the thing that can use the output involved it's 1/100th of that? Do they have anything that does stuff with the outputs from just MobileNet? If they don't, how are they sure I can build 60 fps realtime audiovisual experiences they say I can?
This looks amazing given the parameter sizes and capabilities (audio, visual, text). I like the idea of keeping simple tasks local. I’ll be curious to see if this can be run on an M1 machine…
I made a simple website[0] to check online model MMLU quickly (runs a subset), and Gemma 3n consistently loses to LLaMA 3.3 (~61% vs ~66%), and definitely loses to LLaMA 4 Scout (~86%). I suspect that means its rating on LMArena Leaderboard is just some form of gaming the metric.
What's interesting, that it beats smarter models in my Turing Test Battle Royale[1]. I wonder if it means it is a better talker.
Maybe you could install it on YouTube, where my 78-year-old mother received a spammy advert this morning from a scam app pretending to be an iOS notification.
Kinda sick of companies spending untold billions on this while their core product remains a pile of user-hostile shite. :-)
My post politely describing this blog post does not match Google's own app, running inference on Pixel, is downvoted to -1, below dead posts with one-off short jokes.
I am posting again because I've been here 16 years now, it is very suspicious that happened, and given the replies to it, we now know this blog post is false.
There is no open model that you can download today and run at even 1% of the claims in the blog post.
You can read a reply from someone indicating they have inside knowledge on this, who notes this won't work as advertised unless you're Google (i.e. internally, they have it binding to a privileged system process that can access the Tensor core, and this isn't available to third parties. Anyone else is getting 1/100th of the speeds in the post)
This post promises $150K in prizes for on-device multimodal apps and tells you it's running at up to 60 fps, they know it runs at 0.1 fps, Engineering says it is because they haven't prioritized 3rd parties yet, and somehow, Google is getting away with this.
This model is fully compatible with anything previously done with gemma3. Just passed it to one of my vlm fine-tuning scripts and it started without issues (hf transformer code). On a single GPU with Lora the E4B model takes 18Gb of VRAM in batch size 1 where gemma-4B was 21Gb. Nice one from deepmind, the gemma3 family tops the open weights VLLMs.
Fix: it's the E2B
I tried my "Generate an SVG of a pelican riding a bicycle" prompt against Gemma 3n 7.5GB from Ollama and 15GB for mlx-vlm and got a pleasingly different result for the two quantization sizes: https://simonwillison.net/2025/Jun/26/gemma-3n/
Given how primitive that image is, what's the point of even having an image model at this size?
Is that actually a useful benchmark, or is it just for the laughs? I've never really understood that.
It was supposed to be a joke. But weirdly it turns out there's a correlation between how good a model is and how good it as at my stupid joke benchmark.
I didn't realize quite how strong the correlation was until I put together this talk: https://simonwillison.net/2025/Jun/6/six-months-in-llms/
1 reply →
For me, it shows if LLM are generalising from their training data. LLM understand all of the words in the prompt. they understand the spec for svg better than any human. They know what a bird is. They know what a bike is. They know how to draw (and given access to computer-use could probably ace this test). They can plan and execute on those plans.
Everything here should be trivial for LLM, but they’re quite poor at it because there’s almost no “how to draw complex shapes in svg” type content in their training set.
It’s been useful though given the authors popularity I suspect it’s only a matter of time new LLMs become “more aware” of it
It's useful because it's SVG so it's different than other image generation methods.
I think in 5 years we might have some ultra-realistic pelicans and this benchmark will turn out quite interesting.
2 replies →
[flagged]
4 replies →
I still don't understand the difference between Gemma and Gemini for on-device, since both don't need network access. From https://developer.android.com/ai/gemini-nano :
"Gemini Nano allows you to deliver rich generative AI experiences without needing a network connection or sending data to the cloud." -- replace Gemini with Gemma and the sentence still valid.
Licensing. You can't use Gemini Nano weights directly (at least commercial ly) and must interact with them through Android MLKit or similar Google approved runtimes.
You can use Gemma commercially using whatever runtime or framework you can get to run it.
It's not even clear you can license language model weight though.
I'm not a lawyer but the analysis I've read had a pretty strong argument that there's no human creativity involved in the training, which is an entirely automatic process, and as such it cannot be copyrighted in any way (the same way you cannot put a license on a software artifact just because you compiled it yourself, you must have copyright ownership on the source code you're compiling).
25 replies →
According go the Gemma 3n preview blog, Gemma 3n shares the same architecture as the upcoming version of Gemini Nano.
The ‘n’ presumably stands for Nano.
Nano is a proprietary model that ships with Android. Gemma is an open model that can be adapted and used anywhere.
Sources: https://developers.googleblog.com/en/introducing-gemma-3n/
Video in the in the blog linked in this post
Gemma is open source and apache 2.0 licensed. If you want to include it with an app you have to package it yourself.
gemini nano is an android api that you dont control at all.
> Gemma is open source and apache 2.0 licensed
Closed source but open weight. Let’s not ruin the definition of the term in advantage of big companies.
8 replies →
> Gemma is open source and apache 2.0 licensed.
Are you sure? On a quick look, it appears to use its own bespoke license, not the Apache 2.0 license. And that license appears to have field of use restrictions, which means it would not be classified as an open source license according to the common definitions (OSI, DFSG, FSF).
I suspect the difference is in the training data. Gemini is much more locked down and if it tries to repeat something from the draining data verbatim you will get a 'recitation error'.
Perplexity.ai gave an easier to understand response than Gemini 2.5 afaict.
Gemini nano is for Android only.
Gemma is available for other platforms and has multiple size options.
So it seems like Gemini nano might be a very focused Gemma everywhere to follow the biology metaphor instead of the Italian name interpretation
The fact that you need HN and competitors to explain your offering should make Google reflect …
1 reply →
I'm not a fan of this anarchic naming convention that OpenAI has apparently made standard across the industry.
What would you have called it?
Gemma 4? I feel that one was incredibly obvious. Let us please just increase the version numbers.
Anthropic is better about this, but then shifted their ordering with the v4 models. Arguably better, but still quite annoying since everything pre-4 uses a different naming scheme.
1 reply →
I wouldn't have added a random letter and I would have chosen a name that's less easy to conflate with Gemini.
Made some GGUFs if anyone wants to run them!
./llama.cpp/llama-cli -hf unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL -ngl 99 --jinja --temp 0.0
./llama.cpp/llama-cli -hf unsloth/gemma-3n-E2B-it-GGUF:UD-Q4_K_XL -ngl 99 --jinja --temp 0.0
I'm also working on an inference + finetuning Colab demo! I'm very impressed since Gemma 3N has audio, text and vision! https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-...
Tried the E4B model in Ollama and it's totally broken when interpreting images. The output depends only on the text and is consistent in that way, but otherwise completely wrong.
Works fine with regular Gemma 3 4B, so I'll assume it's something on Ollama's side. edit: yep, text-only for now[1], would be nice if that was a bit more prominent than burried in a ticket...
Don't feel like compiling llama.cpp myself, so I'll have to wait to try your GGUFs there.
[1]: https://github.com/ollama/ollama/issues/10792#issuecomment-3...
Oh I don't think multimodal works yet - it's text only for now!
Literally was typing out "Unsloth, do your thing!!" but you are way ahead of me. You rock <3 <3 <3
Thank you!
:) Thanks!
Thanks! What kind of rig do I need?
Likely nothing crazy. My RTX 2080 is pumping out 45 tok/s.
What is `jinja` in this context?
The chat template is stored as a Jinja template.
https://jinja.palletsprojects.com/en/stable/
I'd genuinely like to know how these small models are useful for anyone. I've done a lot of experimenting, and anything smaller than 27B is basically unusable, except as a toy. All I can say for smaller models is that they sometimes produce good answers, which is not enough for anything except monkeying around.
I solved my spam problem with gemma3:27b-it-qat, and my benchmarks show that this is the size at which the current models start becoming useful.
Qwen2.5-VL 7B is pretty impressive at turning printed or handwritten maths lecture notes into Latex code, and is small enough to run slowly on a laptop without enough VRAM. Gemma3 4B was useless at this though, and got stuck in loops or tried to solve the maths problems instead of just converting the working to Latex (but it was much faster as it fit into VRAM).
It sounds like you're trying to use them like ChatGPT, but I think that's not what they're for.
I am sure as ideation devices these can work fine. I treat this more like basic infra. I would absolutely love the future where most phones have some small LLM built in, kind of like a base layer of infra
There are use cases where even low accuracy could be useful. I can't predict future products, but here are two that are already in place today:
- On the keyboard on iphones some sort of tiny language model suggest what it thinks are the most likely follow up words when writing. You only have to pick a suggested next word if it matches what you were planning on typing.
- Speculative decoding is a technique which utilized smaller models to speed up the inference for bigger models.
I'm sure smart people will invent other future use cases too.
The best use case I've found for tiny models (<5bn params) as a reference tool for when I don't have WiFi. I've been using qwen on my MacBook Air as a replacement for Google while I'm writing code on flights. They work great for asking basic questions about syntax and documentation.
Tiny, 4b or less models are designed for finetuning for some narrow tasks; this way can outperform large commercial models for a tiny fraction of price. Also great for code autocomplete.
7b-8b are great coding assistants if all you need is dumb fast refactoring, that cannot quite be done with macros and standard editor functionality but still primitive, such as "rename all methods having at least one argument of type SomeType by prefixing their names with "ST_".
12b is a threshold where models start writing coherent prose such Mistral Nemo or Gemma 3 12b.
Kevin Kwok did a great job taking it apart: https://github.com/antimatter15/reverse-engineering-gemma-3n
The Y-axis in that graph is fucking hilarious
LM Studio has MLX variants of the model out: http://huggingface.co/lmstudio-community/gemma-3n-E4B-it-MLX...
However it's still 8B parameters and there are no quantized models just yet.
anyone know how much it costs to use the deployed version of gemma 3n? The docs indicate you can use the gemini api for deployed gemma 3n but the pricing page just shows "unavailable"
I read the general parts and skimmed the inner workings but I can't figure out what the high-level news is. What does this concretely do that Gemma didn't already do, or what benchmark/tasks did it improve upon?
Until it goes into the inner details (MatFormer, per-layer embeddings, caching...), the only sentence I've found that concretely mentions a new thing is "the first model under 10 billion parameters to reach [an LMArena score over 1300]". So it's supposed to be better than other models until those that use 10GB+ RAM, if I understand that right?
> What does this concretely do that Gemma didn't already do
Open weights
Huh? I'm pretty sure I ran Gemma on my phone last month. Or is there a difference between downloadable (you get the weights because it's necessary to run the thing) and "open" weights?
3 replies →
We need tabular data somewhere on Google that lists the titles of the products and their descriptions or functions or what they do.
Whats are some use cases for these local small models, for individuals? Seems like for programming related work, the proprietary models are significantly better and that's all I really use LLMs for personally.
Though I can imagine a few commercial applications where something like this would be useful. Maybe in some sort of document processing pipeline.
For me? Handling data like private voice memos, pictures, videos, calendar information, emails, some code etc. Stuff I wouldn't want to share on the internet / have a model potential slurp up and regurgitate as part of its memory when the data is invariably used in some future training process.
I think speech to text is the highlight used case for local models because they are now really good at it and there’s no network latency.
How does it compare to whisper? Does it hallucinate less or is more capable?
I just like having quick access to reasonable model that runs comfortably on my phone, even if I'm in a place without connectivity.
I’m thinking about building a pipeline to mass generate descriptions for the images in my photo collection, to facilitate search. Object recognition in local models is already pretty good, and perhaps I can pair it with models to recognize specific people by name as well.
Hoping to try it out with home assistant.
filtering out spam SMS messages without sending all SMS to the cloud
Suppose I'd like to use models like this one to perform web searches. Is there anything available in the open-source world that would let me do that without much tinkering needed?
I think it’s something that even Google should consider: publishing open-source models with the possibility of grounding their replies in Google Search.
I have been using ollama + open web ui. open webUI already have a web search tool all you would need to do is click the toggle for it under the chat.
Unfortunately the OWUI web search is really slow and just not great overall. I would suggest using an MCP integration instead.
2 replies →
Google do have an API for this. It has limits but perfectly good for personal use.
https://developers.google.com/custom-search/v1/overview
Unfortunately 100 queries per day is quite low for LLMs, which tend to average 5-10 searches per prompt in my experience. And paying for the search API doesn’t seem to be worth it compared to something like a ChatGPT subscription.
3 replies →
If I wanted to run this locally at somewhat decent speeds, is an RK3588S board (like OrangePi 5) the cheapest option?
Tried with S25+ (SD 8 elite). 0.82tok/s(4B L model). It's barely useful speed but it's pretty impressive either.
It depends on your idea of decent speeds and what you would use it for. I just tried it on a laptop with an AMD HX 370 running on battery in power save mode and it's not especially impressive, although it runs much better in balanced or performance mode. I gave it the prompt "write a fizzbuzz program in rust" and it took almost a minute and a half. I expect it to be pretty terrible on an SBC. Your best bet is to try it out on the oldest hardware you have and figure out if you can tolerate worse performance.
good idea, will test that out
I'm going to attempt to get it running on the BeagleY-AI https://www.beagleboard.org/boards/beagley-ai
Similar form factor to raspberry pi but with 4 TOPS of performance and enough RAM.
RK3588 uses a 7 year old CPU design and OrangePi 5 looks expensive (well over $100).
A used sub-$100 x86 box is going to be much better
you're right. For my purposes, I was thinking of something I could use if I wanted to manufacture a new (smallish) product
I've been playing around with E4B in AI Studio and it has been giving me really great results, much better than what you'd expect from an 8B model. In fact I'm thinking of trying to install it on a VPS so I can have an alternative to pricy APIs.
Updated Ollama to use this, now neither old or new work - much productivity
Well, see it the other way, there is something positive: commenters here on HN claim that AI is useless. You can now also join the bandwagon of people who have free time.
It seems way worse than other small models, including responding with complete non sequiturs. I think my favorite small model is still DeepSeek distilled with Llama 8B.
The key here is multimodal.
Anyone have any idea on the viability of running this on a Pi5 16GB? I have a few fun ideas if this can handle working with images (or even video?) well.
The 4-bit quant weighs 4.25 GB and then you need space for the rest of the inference process. So, yeah you can definitely run the model on a Pi, you may have to wait some time for results.
https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF
See here, long story short, this is another in a series of blog posts that would lead you to believe this was viable, but it isn't :/ https://news.ycombinator.com/item?id=44389793
I just tried gemma3 out and it seems to be prone to getting stuck in loops where it outputs an infinite stream of the same word.
Sounds a lot like an autoregressive sampling problem. Maybe try to set temperature and repeat penalty differently.
You're right, I should have checked the model settings. For some reason the default model profile in Ollama had temperature set to 0. Changing the temperature and repeat penalty worked much better than it did when I tried to correct similar behavior in the smallest phi4 reasoning model.
1 reply →
Is there a chance that we see an uncensored version of this ?
Can you apply abiliteration? I'm not sure if their MatFormer architecture is compatible with current techniques
Any readily-available APKs for testing this on Android?
APK link here: https://github.com/google-ai-edge/gallery?tab=readme-ov-file...
Ah, I already had edge installed and it had gemma 3n-e4b downloaded... is this the same model that was previously released?
2 replies →
Somethings really screwy with on-device models from Google, I can't put my finger on what, and I think being ex-Google is screwing with my ability to evaluate.
Cherry-picking something that's quick to evaluate:
"High throughput: Processes up to 60 frames per second on a Google Pixel, enabling real-time, on-device video analysis and interactive experiences."
You can download an APK from the official Google project for this, linked from the blogpost: https://github.com/google-ai-edge/gallery?tab=readme-ov-file...
If I download it, run it on Pixel Fold, actual 2B model which is half the size of the ones the 60 fps claim is made for, it takes 6.2-7.5 seconds to begin responding (3 samples, 3 diff photos). Generation speed is shown at 4-5 tokens per second, slightly slower than what llama.cpp does on my phone. (I maintain an AI app that inter alia, wraps llama.cpp on all platforms)
So, *0.16* frames a second, not 60 fps.
The blog post is so jammed up with so many claims re: this is special for on-device and performance that just...seemingly aren't true. At all.
- Are they missing a demo APK?
- Was there some massive TPU leap since the Pixel Fold release?
- Is there a lot of BS in there that they're pretty sure won't be called out in a systematic way, given the amount of effort it takes to get this inferencing?
- I used to work on Pixel, and I remember thinking that it seemed like there weren't actually public APIs for the TPU. Is that what's going on?
In any case, either:
A) I'm missing something, big or
B) they are lying, repeatedly, big time, in a way that would be shown near-immediately when you actually tried building on it because it "enables real-time, on-device video analysis and interactive experiences."
Everything I've seen the last year or two indicates they are lying, big time, regularly.
But if that's the case:
- How are they getting away with it, over this length of time?
- How come I never see anyone else mention these gaps?
It looks to me by the marketing copy that the vision encoder can run 60FPS.
> MobileNet-V5-300M
Which makes sense as it's 300M in size and probably far less complex, not a multi billions of parameters transformer.
I agree that's the most likely interpretation - does it read as a shell game to you? Like, it can do that but once you get the thing that can use the output involved it's 1/100th of that? Do they have anything that does stuff with the outputs from just MobileNet? If they don't, how are they sure I can build 60 fps realtime audiovisual experiences they say I can?
1 reply →
The APK that you linked, runs the inference on CPU and does not run it on Google Tensor.
That sounds fair, but opens up another N questions:
- Are there APK(s) that run on Tensor?
- Is it possible to run on Tensor if you're not Google?
- Is there anything at all from anyone I can download that'll run it on Tensor?
- If there isn't, why not? (i.e. this isn't the first on device model release by any stretch, so I can't give benefit of the doubt at this point)
5 replies →
How does their demo work then? It's been 3 months since 3n was first released publicly.
1 reply →
This looks amazing given the parameter sizes and capabilities (audio, visual, text). I like the idea of keeping simple tasks local. I’ll be curious to see if this can be run on an M1 machine…
Sure it can, easiest way is to get ollama, then `ollama run gemma3n` You can pair it with tools like simonw's LLM to pipe stuff to it.
This should run fine on most hardware - CPU inference of the E2B model on my Pixel 8 Pro gives me ~9tok/second of decode speed.
Can popular sci-fi go 30 seconds without some lame wad naming themselves or a product after it?
I made a simple website[0] to check online model MMLU quickly (runs a subset), and Gemma 3n consistently loses to LLaMA 3.3 (~61% vs ~66%), and definitely loses to LLaMA 4 Scout (~86%). I suspect that means its rating on LMArena Leaderboard is just some form of gaming the metric.
What's interesting, that it beats smarter models in my Turing Test Battle Royale[1]. I wonder if it means it is a better talker.
0. https://mmlu.borgcloud.ai/
1. https://trashtalk.borg.games/
> for everything from safeguarding
Maybe you could install it on YouTube, where my 78-year-old mother received a spammy advert this morning from a scam app pretending to be an iOS notification.
Kinda sick of companies spending untold billions on this while their core product remains a pile of user-hostile shite. :-)
imagine the entire internet is just an on the fly ui, would be pretty cool
My post politely describing this blog post does not match Google's own app, running inference on Pixel, is downvoted to -1, below dead posts with one-off short jokes.
I am posting again because I've been here 16 years now, it is very suspicious that happened, and given the replies to it, we now know this blog post is false.
There is no open model that you can download today and run at even 1% of the claims in the blog post.
You can read a reply from someone indicating they have inside knowledge on this, who notes this won't work as advertised unless you're Google (i.e. internally, they have it binding to a privileged system process that can access the Tensor core, and this isn't available to third parties. Anyone else is getting 1/100th of the speeds in the post)
This post promises $150K in prizes for on-device multimodal apps and tells you it's running at up to 60 fps, they know it runs at 0.1 fps, Engineering says it is because they haven't prioritized 3rd parties yet, and somehow, Google is getting away with this.
[flagged]
This is completely offtopic, but in case your question is genuine:
https://www.youtube.com/watch?v=F2X1pKEHIYw
> Why Some People Say SHTRONG (the CHRUTH), by Dr Geoff Lindsey
[flagged]