← Back to context

Comment by minimaxir

5 hours ago

The big story here is the encoder-free part, which I still don't fully understand.

> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.

That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...

> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.

This is just early fusion basically.

FAIR did this 2 years ago now: https://arxiv.org/abs/2405.09818

I've been waiting for something like this to be released since then.

The annoying thing is that chameleon was multi-modal out based on the same principles, but this model is just inputs... (I'm curious how they did pre-training without having multi-modal outputs as well. I wonder if they just chopped them off rather than support image output).

  • Some of the FAIR people moved to Thinky, and they also started doing encoder-free MM-LLMs. Now Google. This seems to becoming a trend working at small scale, but the difficult part is scaling.

    Standard approach for training MM-LLMs is we train the encoder first, there are O(2-10B) good images on the internet, so encoder needs to see each image O(10-100) times, that is O(100T) tokens, which is more than the entire pre-training budget for most runs. That is the reason we train the encoder separately (smaller model, 2B active vs 30B or 200B active LLM); there is nothing magical about training the encoder and LLM together, it is just more token-efficient to train the image modality first.

  • I don't think it's the same. It's a similar concept, but Gemma is using just a linear projection, which I assume is a lot faster. The developer guide has more details: https://developers.googleblog.com/gemma-4-12b-the-developer-...

        Vision embedder (35M parameters): Replaces the 27 vision transformer layers of the other medium-sized Gemma 4 models. Raw 48x48 pixel patches are projected to the LLM hidden dimension with a single matmul. A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input
    

    the "single matmul" is the key here, I haven't tried it, but it's probably pretty fast and memory efficient.

Totally agree that it is "encoding" in the general sense, but I think they are referring to the lack of an "encoder" neural network.

  • In hindsight I may have been pedantic.

    • Not at all, I had the same feeling as yours the first time I read it. I think the key is that the "encoder" they're using is just a linear projection, which is probably pretty fast and memory efficient. A single matmul vs a ViT encoder is probably a huge win.

    • Not at all. Getting really pedantic, tokenization is also a form of encoding, so it doesn't matter the modality you're using, you'll end up doing some type of encoding in some way.

      5 replies →

> quantization

12b means 12G @ 8 bits/param (basically lossless) and 6G at 4 b/p (generally accepted 'pretty close' level). Not too bad?

But TBD how well the base model performs before thinking too much about quantization

  • Smaller models are less forgiving to quantization. For a 12B model I wouldn't expect Q4 to be "pretty close", unless it underwent quantization aware training (QAT). Of course it's not set in stone, there's a huge variance between models, so this might surprise.

The audio side is even more interesting, as it seems they totally got rid of positional embedding are just doing a single linear transform to match the LLM input dimension and that's it.

> Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.

  • I guarantee you there's positional information one way or another. they just don't mention it because positional embeddings are extremely cheap computationally, not worth mentioning

    • Agree. Audio has strongly temporal so there is almost certainly some positional encoding one way or another.

    • Ah yeah, thinking further it's probably just using some positioning embedding based on sequence numbering added in the LLM layers. For vision it needs the patch location as well.

One side-effect, is that the separate .mmproj file (Multi-Modal Projection encoder) is no longer needed, when using the model with llama.cpp etc.

  • It's not? There's an mmproj in the GGUFs released by ggml-org: https://huggingface.co/ggml-org/gemma-4-12B-it-GGUF/tree/mai...

    From the visual guide, there's still the 35M parameter embedder, then the linear projector, for vision, and the linear projector for audio, so it does have some parameters used for the multimodal input to project it into the LLM latent space: https://newsletter.maartengrootendorst.com/p/a-visual-guide-...

    And the Unsloth quants, which are missing this, don't support multimodal input. (edit: actually, I may have just needed to update my llama.cpp, will check with an updated llama.cpp soon)

    I'm downloading the ggml-org GGUFs now, I tried Unsloth but got some weird problems, double checking with the bf16 model to see if the issue was just the quant.

I would contend that the actual big story is the gallery app:

https://developers.google.com/edge/gallery

Anyone with a 16GB Mac — that is quite a lot of journalists, surely — can download that, install a model into it, and play.

Surely journalists have to start asking questions at least about OpenAI's consumer revenue projections now.

I am a major, major AI cynic, but I decided to be an informed cynic so I've been playing with local models for agentic work and a bit of CAD-to-image generation. I really quite like the 26B Gemma model — I've been using it to teach myself some fundamental things and learn OpenCode without developing a cloud dependency. It writes fairly good code and it is helping me learn the things I want to learn at a pace that I prefer.

But if this 12B model is even half as close as they say it is, this casts some doubt on the consumer end of the cloud business model, at least in the short term.

(Not clear if this app is using the MTP drafters; I've still not got them working with Gemma myself, though the Qwen 3.6 built-in MTP support is super in LM Studio)

  • I had discounted Edge Gallery because it didn't support system prompts, but now it does so I will give it another go. I believe the implementation does use MTP since I got an update to Gemma-4-E4B on iOS indicating such, and on macOS it's very speedy.

    However, on my 18GB RAM MacBook Pro, selecting Gemma-4-12B-it results in this error:

    > The model "Gemma-4-12B-it' requires more memory (RAM) than is available on your device.

    So yeah, my questions about the 16GB marketing copy are fair.

    • Interesting; they may have fluffed up somewhere then.

      (Though perhaps it'll squeeze in with a small context window? Not sure I understand that aspect yet)

      It does seem to use MTP, yes, and it is quite quick — seemingly the underlying LiteRT stuff can do MTP with Gemma 4 and presumably MTP is a big part of the practicality picture here.

      The system prompt thing was a surprise when I poked around.

I don't think we've bottomed out on what we can do with embedding models. They're these tiny models that absolutely rip on modern cpus with 8 bit int optimizations. Like in my app we can say pretty definitive things about hundreds of millions of places in the world on retrieval tasks on regular hardware.

Either Google changed the text or you editorialised it a tiny bit - just for all others that got excited, they mean 16GB VRAM. So a premium graphics card requiring a >2500€ device is the minimum to run this.

Still progress, but not quite democratic yet.

Weird though that Google might be cannibalising it's own AI subscription service?

  • I haven't tried this model yet, but I can run Gemma 31B w/ the MTP drafter in pure CPU at about 10tok/s so this should run at about 20-30tok/s on a decent CPU, it'll probably run at >50tok/s on any Mac that can fit it, and lots of people have a gaming GPU with enough VRAM. In terms of access to hardware being a gate, it's one you can hop pretty easily.

    • Could you outline how you are running the MTP drafters? I've tried LM Studio but no dice there. I'm probably missing something but I think llama.cpp and Ollama can't do it yet either?

      5 replies →

I think the idea is that the model is seeing embeddings that map directly to underlying pixel data, rather than being fed semantically rich embeddings from an encoder model which itself had seen the raw pixel data.

Encoder free is huge for running on SBCs etc. often the encoding time is a significant fraction of generation time if you are using a VLM as a all purpose vision model

It actually works well because unlike encoders, the latent space is trained on that initial layer so it “knows” what to do with that sparse density. I’ve been using gemma4-12b with Flux2 and its ability to reason on visual input is pretty good. That said, each model is good in their own ways so YMMV but overall, it’s about as solid as Qwen just with a more advanced architecture.

> That's technically encoding

Isn't that just projecting the patches into the d_model size vectors that the models takes?

>I am assuming that involves of quantization

12B model in 16GB seems very reasonable to me, int8 is top quality for running models.

  • The guide describes it as projection although there is apparently an extra step: "A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input."

    12B at int8 would take up 12G memory, or 75% of the system memory which technically fits within 16GB but the OS will not like that. EDIT: On my 18G memory MacBook Pro, LM Studio reports a "partial GPU offload" for the int8 MLX weights. Can't test because the `gemma_unified" architecture is NYI.

    • Yeah and it’s pretty memory efficient with only 8 attention layers so at int8 in 16GB ram maybe you still get 64k-128k context.

      The part I hate though is that I’d bet none of the performance claims are based on int8.

      Why do we care about bf16 benchmarks when no one will be using that with this model.

  • I don’t think so, the HF weights are bf16 which means 24GB + cache/overhead.

    It sounds like marketing spin where the performance claims are based on BF16 and the “runs in 16GB” claim is on a totally different quantized version.