Comment by goobatrooba
3 hours ago
Either Google changed the text or you editorialised it a tiny bit - just for all others that got excited, they mean 16GB VRAM. So a premium graphics card requiring a >2500€ device is the minimum to run this.
Still progress, but not quite democratic yet.
Weird though that Google might be cannibalising it's own AI subscription service?
I've bought a laptop for <1500€ that came with 32GB of RAM and an RTX 3080 with 16GB or VRAM. So I don't think >2500€ device is necessary, though I'm certain it would yield better and faster results.
I haven't tried this model yet, but I can run Gemma 31B w/ the MTP drafter in pure CPU at about 10tok/s so this should run at about 20-30tok/s on a decent CPU, it'll probably run at >50tok/s on any Mac that can fit it, and lots of people have a gaming GPU with enough VRAM. In terms of access to hardware being a gate, it's one you can hop pretty easily.
Could you outline how you are running the MTP drafters? I've tried LM Studio but no dice there. I'm probably missing something but I think llama.cpp and Ollama can't do it yet either?
I just build llama.cpp from scratch on the PR that has MTP drafters.
https://github.com/ggml-org/llama.cpp/pull/23398
Please don't use Ollama, it's a bad actor in the OSS community.
3 replies →
I haven't yet pushed the MTP enabled gemma4 12b model for Ollama because in my testing I wasn't getting a performance bump. The other gemma4 MTP models should work OK right now, but there are some fixes we're just about to push. This is specifically for the MLX backend.
1 reply →
can‘t speak to compatibility with this new model, but oMLX supports MTP drafters very well.
1 reply →
Google is an advertising company first and foremost. At some point, these local models have to fit into that umbrella. I don't quite know how yet, but its going to happen.
That being said, the real value in paid plans is that you get ecosystem integration that can read your gmail, photos, docs, and so on.