Introducing Gemma 3n

19 hours ago (developers.googleblog.com)

168 comments

bundie

This model is fully compatible with anything previously done with gemma3. Just passed it to one of my vlm fine-tuning scripts and it started without issues (hf transformer code). On a single GPU with Lora the E4B model takes 18Gb of VRAM in batch size 1 where gemma-4B was 21Gb. Nice one from deepmind, the gemma3 family tops the open weights VLLMs.

pilooch 7 hours ago

Fix: it's the E2B

simonw 14 hours ago

I tried my "Generate an SVG of a pelican riding a bicycle" prompt against Gemma 3n 7.5GB from Ollama and 15GB for mlx-vlm and got a pleasingly different result for the two quantization sizes: https://simonwillison.net/2025/Jun/26/gemma-3n/

cedws 1 hour ago

Given how primitive that image is, what's the point of even having an image model at this size?
JohnKemeny 13 hours ago
Is that actually a useful benchmark, or is it just for the laughs? I've never really understood that.
- simonw 12 hours ago
  
  It was supposed to be a joke. But weirdly it turns out there's a correlation between how good a model is and how good it as at my stupid joke benchmark.
  I didn't realize quite how strong the correlation was until I put together this talk: https://simonwillison.net/2025/Jun/6/six-months-in-llms/
  
  1 reply →
- OtherShrezzing 13 hours ago
  
  For me, it shows if LLM are generalising from their training data. LLM understand all of the words in the prompt. they understand the spec for svg better than any human. They know what a bird is. They know what a bike is. They know how to draw (and given access to computer-use could probably ace this test). They can plan and execute on those plans.
  Everything here should be trivial for LLM, but they’re quite poor at it because there’s almost no “how to draw complex shapes in svg” type content in their training set.
- jerpint 3 hours ago
  
  It’s been useful though given the authors popularity I suspect it’s only a matter of time new LLMs become “more aware” of it
- dominicrose 3 hours ago
  
  It's useful because it's SVG so it's different than other image generation methods.
- owebmaster 11 hours ago
  
  I think in 5 years we might have some ultra-realistic pelicans and this benchmark will turn out quite interesting.
  
  2 replies →
- mvdtnz 13 hours ago
  
  [flagged]
  
  4 replies →

wiradikusuma 18 hours ago

I still don't understand the difference between Gemma and Gemini for on-device, since both don't need network access. From https://developer.android.com/ai/gemini-nano :

"Gemini Nano allows you to deliver rich generative AI experiences without needing a network connection or sending data to the cloud." -- replace Gemini with Gemma and the sentence still valid.

tyushk 18 hours ago
Licensing. You can't use Gemini Nano weights directly (at least commercial ly) and must interact with them through Android MLKit or similar Google approved runtimes.
You can use Gemma commercially using whatever runtime or framework you can get to run it.
- littlestymaar 17 hours ago
  
  It's not even clear you can license language model weight though.
  I'm not a lawyer but the analysis I've read had a pretty strong argument that there's no human creativity involved in the training, which is an entirely automatic process, and as such it cannot be copyrighted in any way (the same way you cannot put a license on a software artifact just because you compiled it yourself, you must have copyright ownership on the source code you're compiling).
  
  25 replies →
floridianfisher 11 hours ago

According go the Gemma 3n preview blog, Gemma 3n shares the same architecture as the upcoming version of Gemini Nano.
The ‘n’ presumably stands for Nano.
Nano is a proprietary model that ships with Android. Gemma is an open model that can be adapted and used anywhere.
Sources: https://developers.googleblog.com/en/introducing-gemma-3n/
Video in the in the blog linked in this post
jabroni_salad 18 hours ago
Gemma is open source and apache 2.0 licensed. If you want to include it with an app you have to package it yourself.
gemini nano is an android api that you dont control at all.
- nicce 18 hours ago
  
  > Gemma is open source and apache 2.0 licensed
  Closed source but open weight. Let’s not ruin the definition of the term in advantage of big companies.
  
  8 replies →
- cesarb 17 hours ago
  
  > Gemma is open source and apache 2.0 licensed.
  Are you sure? On a quick look, it appears to use its own bespoke license, not the Apache 2.0 license. And that license appears to have field of use restrictions, which means it would not be classified as an open source license according to the common definitions (OSI, DFSG, FSF).
impure 18 hours ago

I suspect the difference is in the training data. Gemini is much more locked down and if it tries to repeat something from the draining data verbatim you will get a 'recitation error'.
readthenotes1 18 hours ago
Perplexity.ai gave an easier to understand response than Gemini 2.5 afaict.
Gemini nano is for Android only.
Gemma is available for other platforms and has multiple size options.
So it seems like Gemini nano might be a very focused Gemma everywhere to follow the biology metaphor instead of the Italian name interpretation
- ridruejo 18 hours ago
  
  The fact that you need HN and competitors to explain your offering should make Google reflect …
  
  1 reply →

actinium226 17 hours ago

I'm not a fan of this anarchic naming convention that OpenAI has apparently made standard across the industry.

unsupp0rted 16 hours ago
What would you have called it?
- Aeolun 10 hours ago
  
  Gemma 4? I feel that one was incredibly obvious. Let us please just increase the version numbers.
  Anthropic is better about this, but then shifted their ordering with the v4 models. Arguably better, but still quite annoying since everything pre-4 uses a different naming scheme.
  
  1 reply →
- actinium226 9 hours ago
  
  I wouldn't have added a random letter and I would have chosen a name that's less easy to conflate with Gemini.

danielhanchen 18 hours ago

Made some GGUFs if anyone wants to run them!

./llama.cpp/llama-cli -hf unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL -ngl 99 --jinja --temp 0.0

./llama.cpp/llama-cli -hf unsloth/gemma-3n-E2B-it-GGUF:UD-Q4_K_XL -ngl 99 --jinja --temp 0.0

I'm also working on an inference + finetuning Colab demo! I'm very impressed since Gemma 3N has audio, text and vision! https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-...

magicalhippo 12 hours ago
Tried the E4B model in Ollama and it's totally broken when interpreting images. The output depends only on the text and is consistent in that way, but otherwise completely wrong.
Works fine with regular Gemma 3 4B, so I'll assume it's something on Ollama's side. edit: yep, text-only for now[1], would be nice if that was a bit more prominent than burried in a ticket...
Don't feel like compiling llama.cpp myself, so I'll have to wait to try your GGUFs there.
[1]: https://github.com/ollama/ollama/issues/10792#issuecomment-3...
- danielhanchen 1 hour ago
  
  Oh I don't think multimodal works yet - it's text only for now!
upghost 18 hours ago
Literally was typing out "Unsloth, do your thing!!" but you are way ahead of me. You rock <3 <3 <3
Thank you!
- danielhanchen 18 hours ago
  
  :) Thanks!
bilsbie 15 hours ago
Thanks! What kind of rig do I need?
- jszymborski 12 hours ago
  
  Likely nothing crazy. My RTX 2080 is pumping out 45 tok/s.
knowaveragejoe 15 hours ago
What is `jinja` in this context?
- Tostino 13 hours ago
  
  The chat template is stored as a Jinja template.
- gowld 14 hours ago
  
  https://jinja.palletsprojects.com/en/stable/

jwr 11 hours ago

I'd genuinely like to know how these small models are useful for anyone. I've done a lot of experimenting, and anything smaller than 27B is basically unusable, except as a toy. All I can say for smaller models is that they sometimes produce good answers, which is not enough for anything except monkeying around.

I solved my spam problem with gemma3:27b-it-qat, and my benchmarks show that this is the size at which the current models start becoming useful.

mkl 1 hour ago

Qwen2.5-VL 7B is pretty impressive at turning printed or handwritten maths lecture notes into Latex code, and is small enough to run slowly on a laptop without enough VRAM. Gemma3 4B was useless at this though, and got stuck in loops or tried to solve the maths problems instead of just converting the working to Latex (but it was much faster as it fit into VRAM).
It sounds like you're trying to use them like ChatGPT, but I think that's not what they're for.
runeblaze 11 hours ago

I am sure as ideation devices these can work fine. I treat this more like basic infra. I would absolutely love the future where most phones have some small LLM built in, kind of like a base layer of infra
concats 5 hours ago

There are use cases where even low accuracy could be useful. I can't predict future products, but here are two that are already in place today:
- On the keyboard on iphones some sort of tiny language model suggest what it thinks are the most likely follow up words when writing. You only have to pick a suggested next word if it matches what you were planning on typing.
- Speculative decoding is a technique which utilized smaller models to speed up the inference for bigger models.
I'm sure smart people will invent other future use cases too.
newswangerd 10 hours ago

The best use case I've found for tiny models (<5bn params) as a reference tool for when I don't have WiFi. I've been using qwen on my MacBook Air as a replacement for Google while I'm writing code on flights. They work great for asking basic questions about syntax and documentation.
iamnotagenius 6 hours ago

Tiny, 4b or less models are designed for finetuning for some narrow tasks; this way can outperform large commercial models for a tiny fraction of price. Also great for code autocomplete.
7b-8b are great coding assistants if all you need is dumb fast refactoring, that cannot quite be done with macros and standard editor functionality but still primitive, such as "rename all methods having at least one argument of type SomeType by prefixing their names with "ST_".
12b is a threshold where models start writing coherent prose such Mistral Nemo or Gemma 3 12b.

conradev 17 hours ago

Kevin Kwok did a great job taking it apart: https://github.com/antimatter15/reverse-engineering-gemma-3n

ericvolp12 17 hours ago

The Y-axis in that graph is fucking hilarious

minimaxir 18 hours ago

LM Studio has MLX variants of the model out: http://huggingface.co/lmstudio-community/gemma-3n-E4B-it-MLX...

However it's still 8B parameters and there are no quantized models just yet.

zknowledge 14 hours ago

anyone know how much it costs to use the deployed version of gemma 3n? The docs indicate you can use the gemini api for deployed gemma 3n but the pricing page just shows "unavailable"

lucb1e 18 hours ago

I read the general parts and skimmed the inner workings but I can't figure out what the high-level news is. What does this concretely do that Gemma didn't already do, or what benchmark/tasks did it improve upon?

Until it goes into the inner details (MatFormer, per-layer embeddings, caching...), the only sentence I've found that concretely mentions a new thing is "the first model under 10 billion parameters to reach [an LMArena score over 1300]". So it's supposed to be better than other models until those that use 10GB+ RAM, if I understand that right?

awestroke 17 hours ago
> What does this concretely do that Gemma didn't already do
Open weights
- lucb1e 15 hours ago
  
  Huh? I'm pretty sure I ran Gemma on my phone last month. Or is there a difference between downloadable (you get the weights because it's necessary to run the thing) and "open" weights?
  
  3 replies →

Brajeshwar 7 hours ago

We need tabular data somewhere on Google that lists the titles of the products and their descriptions or functions or what they do.

nsingh2 16 hours ago

Whats are some use cases for these local small models, for individuals? Seems like for programming related work, the proprietary models are significantly better and that's all I really use LLMs for personally.

Though I can imagine a few commercial applications where something like this would be useful. Maybe in some sort of document processing pipeline.

jsphweid 16 hours ago

For me? Handling data like private voice memos, pictures, videos, calendar information, emails, some code etc. Stuff I wouldn't want to share on the internet / have a model potential slurp up and regurgitate as part of its memory when the data is invariably used in some future training process.
toddmorey 16 hours ago
I think speech to text is the highlight used case for local models because they are now really good at it and there’s no network latency.
- oezi 2 hours ago
  
  How does it compare to whisper? Does it hallucinate less or is more capable?
msabalau 16 hours ago

I just like having quick access to reasonable model that runs comfortably on my phone, even if I'm in a place without connectivity.
thimabi 16 hours ago

I’m thinking about building a pipeline to mass generate descriptions for the images in my photo collection, to facilitate search. Object recognition in local models is already pretty good, and perhaps I can pair it with models to recognize specific people by name as well.
russdill 16 hours ago

Hoping to try it out with home assistant.
androng 16 hours ago

filtering out spam SMS messages without sending all SMS to the cloud

thimabi 16 hours ago

Suppose I'd like to use models like this one to perform web searches. Is there anything available in the open-source world that would let me do that without much tinkering needed?

I think it’s something that even Google should consider: publishing open-source models with the possibility of grounding their replies in Google Search.

vorticalbox 16 hours ago
I have been using ollama + open web ui. open webUI already have a web search tool all you would need to do is click the toggle for it under the chat.
- zettabomb 16 hours ago
  
  Unfortunately the OWUI web search is really slow and just not great overall. I would suggest using an MCP integration instead.
  
  2 replies →
joerick 14 hours ago
Google do have an API for this. It has limits but perfectly good for personal use.
https://developers.google.com/custom-search/v1/overview
- thimabi 13 hours ago
  
  Unfortunately 100 queries per day is quite low for LLMs, which tend to average 5-10 searches per prompt in my experience. And paying for the search API doesn’t seem to be worth it compared to something like a ChatGPT subscription.
  
  3 replies →

lowbatt 16 hours ago

If I wanted to run this locally at somewhat decent speeds, is an RK3588S board (like OrangePi 5) the cheapest option?

zipping1549 3 hours ago

Tried with S25+ (SD 8 elite). 0.82tok/s(4B L model). It's barely useful speed but it's pretty impressive either.
jm4 16 hours ago
It depends on your idea of decent speeds and what you would use it for. I just tried it on a laptop with an AMD HX 370 running on battery in power save mode and it's not especially impressive, although it runs much better in balanced or performance mode. I gave it the prompt "write a fizzbuzz program in rust" and it took almost a minute and a half. I expect it to be pretty terrible on an SBC. Your best bet is to try it out on the oldest hardware you have and figure out if you can tolerate worse performance.
- lowbatt 16 hours ago
  
  good idea, will test that out
babl-yc 15 hours ago

I'm going to attempt to get it running on the BeagleY-AI https://www.beagleboard.org/boards/beagley-ai
Similar form factor to raspberry pi but with 4 TOPS of performance and enough RAM.
ac29 16 hours ago
RK3588 uses a 7 year old CPU design and OrangePi 5 looks expensive (well over $100).
A used sub-$100 x86 box is going to be much better
- lowbatt 16 hours ago
  
  you're right. For my purposes, I was thinking of something I could use if I wanted to manufacture a new (smallish) product

impure 18 hours ago

I've been playing around with E4B in AI Studio and it has been giving me really great results, much better than what you'd expect from an 8B model. In fact I'm thinking of trying to install it on a VPS so I can have an alternative to pricy APIs.

bravetraveler 16 hours ago

Updated Ollama to use this, now neither old or new work - much productivity

rvnx 16 hours ago

Well, see it the other way, there is something positive: commenters here on HN claim that AI is useless. You can now also join the bandwagon of people who have free time.

kccqzy 16 hours ago

It seems way worse than other small models, including responding with complete non sequiturs. I think my favorite small model is still DeepSeek distilled with Llama 8B.

oezi 2 hours ago

The key here is multimodal.

Workaccount2 18 hours ago

Anyone have any idea on the viability of running this on a Pi5 16GB? I have a few fun ideas if this can handle working with images (or even video?) well.

gardnr 18 hours ago

The 4-bit quant weighs 4.25 GB and then you need space for the rest of the inference process. So, yeah you can definitely run the model on a Pi, you may have to wait some time for results.
https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF
refulgentis 17 hours ago

See here, long story short, this is another in a series of blog posts that would lead you to believe this was viable, but it isn't :/ https://news.ycombinator.com/item?id=44389793

ghc 17 hours ago

I just tried gemma3 out and it seems to be prone to getting stuck in loops where it outputs an infinite stream of the same word.

sigmoid10 17 hours ago
Sounds a lot like an autoregressive sampling problem. Maybe try to set temperature and repeat penalty differently.
- ghc 12 hours ago
  
  You're right, I should have checked the model settings. For some reason the default model profile in Ollama had temperature set to 0. Changing the temperature and repeat penalty worked much better than it did when I tried to correct similar behavior in the smallest phi4 reasoning model.
  
  1 reply →

rvnx 16 hours ago

Is there a chance that we see an uncensored version of this ?

throwaway2087 16 hours ago

Can you apply abiliteration? I'm not sure if their MatFormer architecture is compatible with current techniques

tgtweak 18 hours ago

Any readily-available APKs for testing this on Android?

refulgentis 18 hours ago
APK link here: https://github.com/google-ai-edge/gallery?tab=readme-ov-file...
- tgtweak 18 hours ago
  
  Ah, I already had edge installed and it had gemma 3n-e4b downloaded... is this the same model that was previously released?
  
  2 replies →

refulgentis 18 hours ago

Somethings really screwy with on-device models from Google, I can't put my finger on what, and I think being ex-Google is screwing with my ability to evaluate.

Cherry-picking something that's quick to evaluate:

"High throughput: Processes up to 60 frames per second on a Google Pixel, enabling real-time, on-device video analysis and interactive experiences."

You can download an APK from the official Google project for this, linked from the blogpost: https://github.com/google-ai-edge/gallery?tab=readme-ov-file...

If I download it, run it on Pixel Fold, actual 2B model which is half the size of the ones the 60 fps claim is made for, it takes 6.2-7.5 seconds to begin responding (3 samples, 3 diff photos). Generation speed is shown at 4-5 tokens per second, slightly slower than what llama.cpp does on my phone. (I maintain an AI app that inter alia, wraps llama.cpp on all platforms)

So, *0.16* frames a second, not 60 fps.

The blog post is so jammed up with so many claims re: this is special for on-device and performance that just...seemingly aren't true. At all.

- Are they missing a demo APK?

- Was there some massive TPU leap since the Pixel Fold release?

- Is there a lot of BS in there that they're pretty sure won't be called out in a systematic way, given the amount of effort it takes to get this inferencing?

- I used to work on Pixel, and I remember thinking that it seemed like there weren't actually public APIs for the TPU. Is that what's going on?

In any case, either:

A) I'm missing something, big or

B) they are lying, repeatedly, big time, in a way that would be shown near-immediately when you actually tried building on it because it "enables real-time, on-device video analysis and interactive experiences."

Everything I've seen the last year or two indicates they are lying, big time, regularly.

But if that's the case:

- How are they getting away with it, over this length of time?

- How come I never see anyone else mention these gaps?

mlsu 17 hours ago
It looks to me by the marketing copy that the vision encoder can run 60FPS.
> MobileNet-V5-300M
Which makes sense as it's 300M in size and probably far less complex, not a multi billions of parameters transformer.
- refulgentis 17 hours ago
  
  I agree that's the most likely interpretation - does it read as a shell game to you? Like, it can do that but once you get the thing that can use the output involved it's 1/100th of that? Do they have anything that does stuff with the outputs from just MobileNet? If they don't, how are they sure I can build 60 fps realtime audiovisual experiences they say I can?
  
  1 reply →
catchmrbharath 17 hours ago
The APK that you linked, runs the inference on CPU and does not run it on Google Tensor.
- refulgentis 17 hours ago
  
  That sounds fair, but opens up another N questions:
  - Are there APK(s) that run on Tensor?
  - Is it possible to run on Tensor if you're not Google?
  - Is there anything at all from anyone I can download that'll run it on Tensor?
  - If there isn't, why not? (i.e. this isn't the first on device model release by any stretch, so I can't give benefit of the doubt at this point)
  
  5 replies →
- lostmsu 17 hours ago
  
  How does their demo work then? It's been 3 months since 3n was first released publicly.
  
  1 reply →

turnsout 18 hours ago

This looks amazing given the parameter sizes and capabilities (audio, visual, text). I like the idea of keeping simple tasks local. I’ll be curious to see if this can be run on an M1 machine…

Fergusonb 18 hours ago

Sure it can, easiest way is to get ollama, then `ollama run gemma3n` You can pair it with tools like simonw's LLM to pipe stuff to it.
bigyabai 18 hours ago

This should run fine on most hardware - CPU inference of the E2B model on my Pixel 8 Pro gives me ~9tok/second of decode speed.

kgwxd 14 hours ago

Can popular sci-fi go 30 seconds without some lame wad naming themselves or a product after it?

lostmsu 17 hours ago

I made a simple website[0] to check online model MMLU quickly (runs a subset), and Gemma 3n consistently loses to LLaMA 3.3 (~61% vs ~66%), and definitely loses to LLaMA 4 Scout (~86%). I suspect that means its rating on LMArena Leaderboard is just some form of gaming the metric.

What's interesting, that it beats smarter models in my Turing Test Battle Royale[1]. I wonder if it means it is a better talker.

0. https://mmlu.borgcloud.ai/

1. https://trashtalk.borg.games/

jen729w 10 hours ago

> for everything from safeguarding

Maybe you could install it on YouTube, where my 78-year-old mother received a spammy advert this morning from a scam app pretending to be an iOS notification.

Kinda sick of companies spending untold billions on this while their core product remains a pile of user-hostile shite. :-)

lerp-io 7 hours ago

imagine the entire internet is just an on the fly ui, would be pretty cool

refulgentis 14 hours ago

My post politely describing this blog post does not match Google's own app, running inference on Pixel, is downvoted to -1, below dead posts with one-off short jokes.

I am posting again because I've been here 16 years now, it is very suspicious that happened, and given the replies to it, we now know this blog post is false.

There is no open model that you can download today and run at even 1% of the claims in the blog post.

You can read a reply from someone indicating they have inside knowledge on this, who notes this won't work as advertised unless you're Google (i.e. internally, they have it binding to a privileged system process that can access the Tensor core, and this isn't available to third parties. Anyone else is getting 1/100th of the speeds in the post)

This post promises $150K in prizes for on-device multimodal apps and tells you it's running at up to 60 fps, they know it runs at 0.1 fps, Engineering says it is because they haven't prioritized 3rd parties yet, and somehow, Google is getting away with this.

jacknews 18 hours ago

[flagged]

svat 18 hours ago

This is completely offtopic, but in case your question is genuine:
https://www.youtube.com/watch?v=F2X1pKEHIYw
> Why Some People Say SHTRONG (the CHRUTH), by Dr Geoff Lindsey

bn-l 18 hours ago

[flagged]