Fixing Gemma Bugs

1 year ago (unsloth.ai)

Something I've noticed with open weight models is the rush to judgment as soon as they are released. But most people aren't actually running these models in full fp16 mode with the code supplied, they're using quantized versions with the tip of tree patches to libraries like llama.cpp to get them running. And posts like this just show that it takes a bit for the software side of the model to get all the kinks worked out. We saw this with Mixtral (new architecture), CodeLlama-70b (new, very strict, prompt format), and now Gemma.

In some ways it's makes my so excited realizing how early this technology still is! There's going to be so much innovation and cool things that will get built over the next several years, and so much new stuff to learn!

  • Oh yes that's a fair point on precision! In fact the majority of issues for Gemma (other than the Approx v exact GELU) issue are precision based - ie it's fine in float32, but loses a lot of accuracy in bfloat16 or float16 domain!

    • On a related point, there's a post on r/LocalLlama/ a short while ago claiming that quantization impacts perf more than people think:

      https://www.reddit.com/r/LocalLLaMA/comments/1b5uv86/perplex...

      The argument is that while perplexity is used as evidence that quantized models perform almost just as well as the original float weights, perplexity tends to measure whether the output looks correct, but it doesn't measure performance (roughly equivalent to "intelligence") when you need more nuance.

      I haven't been able to observe this myself, perhaps I haven't been playing with language models too much (or I haven't tried to stretch their abilities to their limits enough), but from a theoretical perspective what they say make a lot of sense. Even at the inference stage, the fine details of the implementation of the inference software and parameters could make a big difference of the performance of the models.

      So I'd be very skeptical of people trying to evaluate the performance (i.e. intelligence level) of models besides using the stack (preferably down to the hardware) suggested by the party that released the model.

      3 replies →

This gave me lots of confidence in Unsloth when I first read it.

I'll admit I was a little skeptical of Unsloth, since anything that boasts free perf improvement, just by dropping in some middleware, makes me suspicious. Especially from such a small team.

I assumed it was just introducing some hacks that create an inexact implementation of attention or some faster-but-inaccurate cuda kernels or something.

But now I believe this small team really knows their stuff :)

Incredible work by the author stepping through all the nitty-gritty details and showing how easy it is to miss something subtle that could degrade performance.

Really clean usage of Colab btw. I just had to click a single button and everything ran.

Good job, will join the Discord!

  • I was just thinking how terrible the website was because it doesn't gracefully degrade. There's no information if you don't successfully execute all the applications associated with the page, just a blank white page with nothing. For a blog post with text and images this is really bad. The text and images should be there in the HTML and then the dynamic elements loaded on top.

    Even when I loaded it in the browser I use for banks, etc, I still get errors and the JS doesn't run quite right and I get a "NameError: name 'torch' is not defined", "NameError: name 'FastLanguageModel' is not defined" etc.

    • Oh ye you'll have to click "Runtime" -> "Run All". I think you probably forgot to execute the installation cell.

      Apologies on the website rendering issues :( I normally find Colab to be reasonably responsive, so presumably the Javascript components are breaking :( Much apologies :(

    • It sounds like you are simply not familiar with how Colab works, this has nothing to do with the original work

  • Oh thanks! I love Colab since it provides a free GPU + you can run the code + you can write a blog post inside it :)

Is there a way to read this without a Google login ?

Wow, colab.research.google.com, that's a terrible domain name for hosting Google-embarassingn user generated content.

Edit: the comment below refers to Gemini, not Gemma. As such the first paragraph is largely irrelevant, and only the second one applies.

To me, it feels as though the boat has been missed somewhat. The restrictions on Gemini make it unhelpful, but more than that, Claude 3 has really blown me away with its code suggestions. It's performing better than Mistral Large, GPT4 and Gemma in my tests, especially for large bits of code. It also returns the whole hog with changes, making it much easier to plug and play. Astonishingly, it also manages to combine ideas much better than any other LLM I've seen to date.

I suspect these fixes and the knowledge gained will be helpful to the community however, and will help improve the next iteration of models.

  • Claude 3 is very capable, but it is (likely) a 1T class model, not something that can be run on the edge, while 7B class models can already be run on phones and can be easily fine-tuned to do specialized work that can perform comparably to those big general models.

    If you are talking to one model, by all means, use the best one you have available (personally, Claude not having a code interpreter/able to self-evaluate code still makes it oftentimes less useful than ChatGPT (or, even smaller open models like OpenCodeInterpreter - OpenCodeInterpreter-DS-33B outperforms all models including GPT-4 w/ CI on HumanEval+ and MBPP+ [1][2]). Recently I've been swapping between GPT4, Claude 3 Opus, and Phind for coding and finding that sometimes one will do better than another on specific tasks (sadly my GPUs are currently busy, but I really want to queue OCI-DS-33B up and do a shootout soon).

    One issue with Gemma that doesn't get mentioned enough IMO is that while it claims to be 7B, it's really 8.54B parameters. It also has a gigantic tokenizer, so memory usage-wise, even quantized it is going to be significantly more than comparable 7B models. Once you are getting to 9B, you have other options as - the new Yi-9B, or if you want Apache licensed (stacked Mistral), you can use SOLAR-10.7B or the new bigstral-12b-32k.

    [1] https://huggingface.co/m-a-p/OpenCodeInterpreter-DS-33B

    [2] https://evalplus.github.io/leaderboard.html

    • Ye the gigantic tokenizer does eat up VRAM a lot. Although Gemma uses tied embeddings (ie lm_head == embeddings), this does make it use 50% less VRAM in terms of space, but still requires more VRAM since you have to add the gradients up at the end.

  • why are you comparing Claude 3, a ~14b and ~>200b model, to Gemma, a 2-7B model? of course it's going to do worse. the question for smol models is can it do good enough given a performance budget.

  • Does that give us more information about Gemma? The others are paywall'd best in class models with an order of magnitude higher parameter count.

Does anyone know if the major dealbreaker “Additional Terms” apply to Gemma? Because I don’t want to touch anything Google related with a 100 foot pole given the following:

> Use restrictions You may not use the Services to develop machine learning models or related technology.

https://policies.google.com/terms/generative-ai

Note that using Gemini chat model develops it so, taken extremely seriously, this is a blanket ban on sending text to Gemini

  • Law tends to go by plain English meaning, ex. here, you understand that the idea isn't to ban people from interacting with Gemini, but rather, to stop them from using it to develop new models (i.e. using it's outputs as inputs for training another model)

    • Hmm, I took it to mean you couldn’t even ask about ML and you must avert your eyes when SGE pops up on ML queries on Google.

      Anyway Google lost me as a customer for that so I promise not to help them “develop their models!”