← Back to context

Comment by canyon289

4 days ago

Hey all, I created this model with a top notch team. I answered many questions last week when this hit the front page, and happy to answer more here as well.

https://news.ycombinator.com/item?id=44902148

Personally I'm excited that you all have access to this model now and hope you all get value out of using them.

I would like to know your thoughts on using 2/3 of such a small the model's size for embeddings. What would be different if you used a byte-level vocabulary and spent the parameter budget on transformer parameters instead? I think you would lose performance (tok/s) but might gain accuracy.

  • At this small scale the embeddings indeed were a big focus. Consider this thought process.

    The tokens themselves are a form of compression. Lets say we have the word "WaffleHouse", character level this would be 11 tokens, but with an embedder this would be perhaps 2 or 3 tokens (I didn't actually run through the tokenizer but we could verify precisely). This matters a lot for on device processing especially.

    So while we could get more intelligence out of the model by bumping up the "knowledge" parameters, the device would need to process more input and output tokens.

    Another advantage on small devices is the embeddings are just a lookup table which requires little to no computation. Its the rest of the parameters that have the expensive matrix multplications, so if we increased those we'd also be increasing the number of FLOPs needed for a forward pass.

    This blog post explains it well. https://www.adamcasson.com/posts/transformer-flops

    So all this to say is there are definite tradeoffs between model size, performance on evals, and compute cost. We ran many internal experiments with different choices to see could work well, and then picked what we believed work will best for the open community.

    • How would this matrix get trained with PyTorch? I currently have a toy Transformer network - I ended up marking the matrix as sparse and using SparseAdam - gives a bit of a performance boost, but at the same time I can't use torch.compile() on the fetch from this matrix.

    • Does Gemma use any specific scheme to compress embeddings? Which have you considered?

      For instance, it's well-known that transformer embeddings tend to form clusters. Have you considered splitting the embedding table into "cluster centroid" and "offset from centroid" tables, where the later would presumably have a smaller range and precision?

Very stupid question: why does the tflite model output only '[multimodal][multimodal]' when executed on GPU in the AI edge gallery app, while fully working on the CPU.

Thanks for your work, it is really an amazing small LM.

Can you share what kind of hardware is necessary to train it, and how long it took?

  • Thank you!

    The Gemma3 technical report contains many details on training setup https://arxiv.org/pdf/2503.19786

    This was released with the initial batch of Gemma3 so it doesn't contain the 270m details, nonetheless you'll get a good idea of what it takes to build these models.

As a non MLE, what are the pros/cons of OP's PyTorch re-implementation?

  • It is extremely valuable for researchers that commonly prototype theories using PyTorch on less powerful devices. Many of my colleagues run theory experiments using GPT-2 models. This allows for an easy transition to testing on a SOTA model instead.

  • I'm not a ML engineer, so I can speak to the "non MLE" bit from my perspective

    (literal tl;dr: learning and experimentation opportunity)

    1. Since it's just PyTorch, that means one can run it locally upon whatever accelerator you have that PyTorch supports. For quite a few people that includes Metal Performance Shaders: https://docs.pytorch.org/docs/stable/mps.html

    I can attest that building PyTorch from git is achievable in about 15 minutes on my M1 Pro, if you really want to chase the rabbithole. Cloning PyTorch is its own special 'please. wait.', but building it is fine

    2. Since it's (of the ones that I've looked at) approximately 500 lines long, it's much, much, much more digestable than a lot of the vomit that comes out of so-called production systems. Those systems usually have only heard about typed Python in passing, and they believe it is a fad that will blow over. The ones in this repo aren't stellar about it, but at 500 lines it's easily achievable to type hint the code yourself, which can serve as an excellent learning opportunity

    3. PyTorch offers some fun conversion tools, also, allowing one to compare-and-contrast how it executes under Torch versus ONNX <https://docs.pytorch.org/docs/stable/onnx.html>, TorchScript <https://docs.pytorch.org/docs/stable/generated/torch.jit.sav...>, CoreML <https://apple.github.io/coremltools/docs-guides/source/conve...>, or a bazillion other competing frameworks

    4. Related, one can play around with quantization and other "inference related" concerns (e.g. https://github.com/pytorch/ao#pytorch-native-training-to-ser... )

    5. Further related, one can play around with the fine-tuning mentioned elsewhere, to better understand what is and isn't possible to achieve using that process. Because the code is digestable, and the models are reasonably sized (Qwen 0.6B weighs only 1.4GB and is Apache 2), it brings FAFO opportunities in ways that gpt-oss-20b (or bigger!) won't

    I do appreciate that some of what I said may skate close to "ML engineer" concerns, so obviously your situation will be different, but for me having a better grip on how these things work enables me to have better conversations with my colleagues and also helps trip my bullshit detector when someone claims they're the second coming and are going to cure cancer or whatever

Thanks for making this! One of my favorite projects was having a Discord chatbot powered by the original BERT model - these 270M weights are a fine upgrade.

Does it have function calls? Can we use it with MCP?

  • It can possibly perform basic prompted FC but I wouldn't get your hopes up. It should be to be a solild FC model if trained on specific tools and format. I would not expect great MCP performance because the context window is 32k and most MCP servers I've see implicitly assume massive context windows.