Something I've noticed with open weight models is the rush to judgment as soon as they are released. But most people aren't actually running these models in full fp16 mode with the code supplied, they're using quantized versions with the tip of tree patches to libraries like llama.cpp to get them running. And posts like this just show that it takes a bit for the software side of the model to get all the kinks worked out. We saw this with Mixtral (new architecture), CodeLlama-70b (new, very strict, prompt format), and now Gemma.
In some ways it's makes my so excited realizing how early this technology still is! There's going to be so much innovation and cool things that will get built over the next several years, and so much new stuff to learn!
Oh yes that's a fair point on precision! In fact the majority of issues for Gemma (other than the Approx v exact GELU) issue are precision based - ie it's fine in float32, but loses a lot of accuracy in bfloat16 or float16 domain!
The argument is that while perplexity is used as evidence that quantized models perform almost just as well as the original float weights, perplexity tends to measure whether the output looks correct, but it doesn't measure performance (roughly equivalent to "intelligence") when you need more nuance.
I haven't been able to observe this myself, perhaps I haven't been playing with language models too much (or I haven't tried to stretch their abilities to their limits enough), but from a theoretical perspective what they say make a lot of sense. Even at the inference stage, the fine details of the implementation of the inference software and parameters could make a big difference of the performance of the models.
So I'd be very skeptical of people trying to evaluate the performance (i.e. intelligence level) of models besides using the stack (preferably down to the hardware) suggested by the party that released the model.
This gave me lots of confidence in Unsloth when I first read it.
I'll admit I was a little skeptical of Unsloth, since anything that boasts free perf improvement, just by dropping in some middleware, makes me suspicious. Especially from such a small team.
I assumed it was just introducing some hacks that create an inexact implementation of attention or some faster-but-inaccurate cuda kernels or something.
But now I believe this small team really knows their stuff :)
The founder I know personally, he interned at Nvidia and contributed many performance improvements, he's the real deal - just really enthuasitic so it may come off as boastfulness ;)
They’ve had their work applauded by Karpathy and Jeremy P. Howard as well, which are about the best credentials you could ever get for open source AI stuff:
I’ve been using the library since it started out and it works really well. Daniel is also super helpful and responsive in their Discord, assisting everyone from the most basic users to breaking down complex ML math stuff.
Thanks to Andrej and Jeremy as well :) And also thanks to community members like you! It makes me super happy to keep making Unsloth better so appreciate it a lot!
Oh thanks! I get that a lot :) But ye there's no approximations at all! Just special maths hacks with no degradations, rewriting everything and creating a custom backprop engine, sprinkling Triton / CUDA everywhere and more :)
But thanks you believe in me + my bro more :) Appreciate it a lot!
Incredible work by the author stepping through all the nitty-gritty details and showing how easy it is to miss something subtle that could degrade performance.
I was just thinking how terrible the website was because it doesn't gracefully degrade. There's no information if you don't successfully execute all the applications associated with the page, just a blank white page with nothing. For a blog post with text and images this is really bad. The text and images should be there in the HTML and then the dynamic elements loaded on top.
Even when I loaded it in the browser I use for banks, etc, I still get errors and the JS doesn't run quite right and I get a "NameError: name 'torch' is not defined", "NameError: name 'FastLanguageModel' is not defined" etc.
Oh ye you'll have to click "Runtime" -> "Run All". I think you probably forgot to execute the installation cell.
Apologies on the website rendering issues :( I normally find Colab to be reasonably responsive, so presumably the Javascript components are breaking :( Much apologies :(
If you load it in a private tab it doesn't ask but it asks if I use a browser session where I am already logged on google (haven't ever loaded Colab in this particular account).
Edit: the comment below refers to Gemini, not Gemma. As such the first paragraph is largely irrelevant, and only the second one applies.
To me, it feels as though the boat has been missed somewhat. The restrictions on Gemini make it unhelpful, but more than that, Claude 3 has really blown me away with its code suggestions. It's performing better than Mistral Large, GPT4 and Gemma in my tests, especially for large bits of code. It also returns the whole hog with changes, making it much easier to plug and play. Astonishingly, it also manages to combine ideas much better than any other LLM I've seen to date.
I suspect these fixes and the knowledge gained will be helpful to the community however, and will help improve the next iteration of models.
Claude 3 is very capable, but it is (likely) a 1T class model, not something that can be run on the edge, while 7B class models can already be run on phones and can be easily fine-tuned to do specialized work that can perform comparably to those big general models.
If you are talking to one model, by all means, use the best one you have available (personally, Claude not having a code interpreter/able to self-evaluate code still makes it oftentimes less useful than ChatGPT (or, even smaller open models like OpenCodeInterpreter - OpenCodeInterpreter-DS-33B outperforms all models including GPT-4 w/ CI on HumanEval+ and MBPP+ [1][2]). Recently I've been swapping between GPT4, Claude 3 Opus, and Phind for coding and finding that sometimes one will do better than another on specific tasks (sadly my GPUs are currently busy, but I really want to queue OCI-DS-33B up and do a shootout soon).
One issue with Gemma that doesn't get mentioned enough IMO is that while it claims to be 7B, it's really 8.54B parameters. It also has a gigantic tokenizer, so memory usage-wise, even quantized it is going to be significantly more than comparable 7B models. Once you are getting to 9B, you have other options as - the new Yi-9B, or if you want Apache licensed (stacked Mistral), you can use SOLAR-10.7B or the new bigstral-12b-32k.
Ye the gigantic tokenizer does eat up VRAM a lot. Although Gemma uses tied embeddings (ie lm_head == embeddings), this does make it use 50% less VRAM in terms of space, but still requires more VRAM since you have to add the gradients up at the end.
why are you comparing Claude 3, a ~14b and ~>200b model, to Gemma, a 2-7B model? of course it's going to do worse. the question for smol models is can it do good enough given a performance budget.
Does anyone know if the major dealbreaker “Additional Terms” apply to Gemma? Because I don’t want to touch anything Google related with a 100 foot pole given the following:
> Use restrictions
You may not use the Services to develop machine learning models or related technology.
Law tends to go by plain English meaning, ex. here, you understand that the idea isn't to ban people from interacting with Gemini, but rather, to stop them from using it to develop new models (i.e. using it's outputs as inputs for training another model)
The issue I find problematic is "Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma." - it's a bit vague - I guess it's not enforceable!
Something I've noticed with open weight models is the rush to judgment as soon as they are released. But most people aren't actually running these models in full fp16 mode with the code supplied, they're using quantized versions with the tip of tree patches to libraries like llama.cpp to get them running. And posts like this just show that it takes a bit for the software side of the model to get all the kinks worked out. We saw this with Mixtral (new architecture), CodeLlama-70b (new, very strict, prompt format), and now Gemma.
In some ways it's makes my so excited realizing how early this technology still is! There's going to be so much innovation and cool things that will get built over the next several years, and so much new stuff to learn!
Oh yes that's a fair point on precision! In fact the majority of issues for Gemma (other than the Approx v exact GELU) issue are precision based - ie it's fine in float32, but loses a lot of accuracy in bfloat16 or float16 domain!
On a related point, there's a post on r/LocalLlama/ a short while ago claiming that quantization impacts perf more than people think:
https://www.reddit.com/r/LocalLLaMA/comments/1b5uv86/perplex...
The argument is that while perplexity is used as evidence that quantized models perform almost just as well as the original float weights, perplexity tends to measure whether the output looks correct, but it doesn't measure performance (roughly equivalent to "intelligence") when you need more nuance.
I haven't been able to observe this myself, perhaps I haven't been playing with language models too much (or I haven't tried to stretch their abilities to their limits enough), but from a theoretical perspective what they say make a lot of sense. Even at the inference stage, the fine details of the implementation of the inference software and parameters could make a big difference of the performance of the models.
So I'd be very skeptical of people trying to evaluate the performance (i.e. intelligence level) of models besides using the stack (preferably down to the hardware) suggested by the party that released the model.
3 replies →
Gemma.cpp was also affected, and now fixed (https://github.com/google/gemma.cpp/pull/93). Thanks for the heads-up!
1 reply →
This gave me lots of confidence in Unsloth when I first read it.
I'll admit I was a little skeptical of Unsloth, since anything that boasts free perf improvement, just by dropping in some middleware, makes me suspicious. Especially from such a small team.
I assumed it was just introducing some hacks that create an inexact implementation of attention or some faster-but-inaccurate cuda kernels or something.
But now I believe this small team really knows their stuff :)
The founder I know personally, he interned at Nvidia and contributed many performance improvements, he's the real deal - just really enthuasitic so it may come off as boastfulness ;)
Apologies on the enthusiasm!! :) And hi!!
5 replies →
They’ve had their work applauded by Karpathy and Jeremy P. Howard as well, which are about the best credentials you could ever get for open source AI stuff:
https://twitter.com/karpathy/status/1765473722985771335
I’ve been using the library since it started out and it works really well. Daniel is also super helpful and responsive in their Discord, assisting everyone from the most basic users to breaking down complex ML math stuff.
Thanks to Andrej and Jeremy as well :) And also thanks to community members like you! It makes me super happy to keep making Unsloth better so appreciate it a lot!
real recognizing real
1 reply →
The article mentioned in the comment: https://unsloth.ai/blog/gemma-bugs
Whoops I think I might have forgotten to add it to the Colab!!
Oh thanks! I get that a lot :) But ye there's no approximations at all! Just special maths hacks with no degradations, rewriting everything and creating a custom backprop engine, sprinkling Triton / CUDA everywhere and more :)
But thanks you believe in me + my bro more :) Appreciate it a lot!
Incredible work by the author stepping through all the nitty-gritty details and showing how easy it is to miss something subtle that could degrade performance.
Thanks! :) I'm pushing them into transformers, pytorch-gemma and collabing with the Gemma team to resolve all the issues :)
The RoPE fix should already be in transformers 4.38.2: https://github.com/huggingface/transformers/pull/29285
My main PR for transformers which fixes most of the issues (some still left): https://github.com/huggingface/transformers/pull/29402
Incredible indeed! Just hunting down one of these bugs feels like a very time consuming endeavor.
What's your approach for these more subtle numerical bugs?
3 replies →
The post got edited, so for the Colab which highlights all the fixes + allows you to run finetuning of Gemma 2.5x faster: https://colab.research.google.com/drive/1fxDWAfPIbC-bHwDSVj5...
This is why open source can be net beneficial for companies too, helping them spot bugs easily and improve their own tools
Open source for the win!
Really clean usage of Colab btw. I just had to click a single button and everything ran.
Good job, will join the Discord!
I was just thinking how terrible the website was because it doesn't gracefully degrade. There's no information if you don't successfully execute all the applications associated with the page, just a blank white page with nothing. For a blog post with text and images this is really bad. The text and images should be there in the HTML and then the dynamic elements loaded on top.
Even when I loaded it in the browser I use for banks, etc, I still get errors and the JS doesn't run quite right and I get a "NameError: name 'torch' is not defined", "NameError: name 'FastLanguageModel' is not defined" etc.
Oh ye you'll have to click "Runtime" -> "Run All". I think you probably forgot to execute the installation cell.
Apologies on the website rendering issues :( I normally find Colab to be reasonably responsive, so presumably the Javascript components are breaking :( Much apologies :(
It sounds like you are simply not familiar with how Colab works, this has nothing to do with the original work
??
Just click the `Run All` button. There's no variance, you will always get the same output.
Oh thanks! I love Colab since it provides a free GPU + you can run the code + you can write a blog post inside it :)
Substantial improvements. Thanks for sharing them.
Thanks! :)
Here are the missing links:
* Gemma, a family of open models from Google: https://ai.google.dev/gemma
* Unsloth is a tool for training models faster (IIUC): https://github.com/unslothai/unsloth
Oops apologies on the links! I had to go to sleep - I should have added them maybe as a comment - but glad you did! Thanks!
[flagged]
Is there a way to read this without a Google login ?
You can try our blog post if that works :) https://unsloth.ai/blog/gemma-bugs Also our Twitter feed https://twitter.com/danielhanchen/status/1765446273661075609 which lists all the bugs as well - apologies on the issue!
Thanks for the tip!
1 reply →
If you load it in a private tab it doesn't ask but it asks if I use a browser session where I am already logged on google (haven't ever loaded Colab in this particular account).
I didn't need a Google login to read it
To help keep your account secure, Google needs to verify it’s you. Please sign in again to continue.
1 reply →
Wow, colab.research.google.com, that's a terrible domain name for hosting Google-embarassingn user generated content.
Edit: the comment below refers to Gemini, not Gemma. As such the first paragraph is largely irrelevant, and only the second one applies.
To me, it feels as though the boat has been missed somewhat. The restrictions on Gemini make it unhelpful, but more than that, Claude 3 has really blown me away with its code suggestions. It's performing better than Mistral Large, GPT4 and Gemma in my tests, especially for large bits of code. It also returns the whole hog with changes, making it much easier to plug and play. Astonishingly, it also manages to combine ideas much better than any other LLM I've seen to date.
I suspect these fixes and the knowledge gained will be helpful to the community however, and will help improve the next iteration of models.
Claude 3 is very capable, but it is (likely) a 1T class model, not something that can be run on the edge, while 7B class models can already be run on phones and can be easily fine-tuned to do specialized work that can perform comparably to those big general models.
If you are talking to one model, by all means, use the best one you have available (personally, Claude not having a code interpreter/able to self-evaluate code still makes it oftentimes less useful than ChatGPT (or, even smaller open models like OpenCodeInterpreter - OpenCodeInterpreter-DS-33B outperforms all models including GPT-4 w/ CI on HumanEval+ and MBPP+ [1][2]). Recently I've been swapping between GPT4, Claude 3 Opus, and Phind for coding and finding that sometimes one will do better than another on specific tasks (sadly my GPUs are currently busy, but I really want to queue OCI-DS-33B up and do a shootout soon).
One issue with Gemma that doesn't get mentioned enough IMO is that while it claims to be 7B, it's really 8.54B parameters. It also has a gigantic tokenizer, so memory usage-wise, even quantized it is going to be significantly more than comparable 7B models. Once you are getting to 9B, you have other options as - the new Yi-9B, or if you want Apache licensed (stacked Mistral), you can use SOLAR-10.7B or the new bigstral-12b-32k.
[1] https://huggingface.co/m-a-p/OpenCodeInterpreter-DS-33B
[2] https://evalplus.github.io/leaderboard.html
Ye the gigantic tokenizer does eat up VRAM a lot. Although Gemma uses tied embeddings (ie lm_head == embeddings), this does make it use 50% less VRAM in terms of space, but still requires more VRAM since you have to add the gradients up at the end.
why are you comparing Claude 3, a ~14b and ~>200b model, to Gemma, a 2-7B model? of course it's going to do worse. the question for smol models is can it do good enough given a performance budget.
Does that give us more information about Gemma? The others are paywall'd best in class models with an order of magnitude higher parameter count.
It's possible that GP confused Gemma and Gemini.
Does anyone know if the major dealbreaker “Additional Terms” apply to Gemma? Because I don’t want to touch anything Google related with a 100 foot pole given the following:
> Use restrictions You may not use the Services to develop machine learning models or related technology.
https://policies.google.com/terms/generative-ai
Note that using Gemini chat model develops it so, taken extremely seriously, this is a blanket ban on sending text to Gemini
Law tends to go by plain English meaning, ex. here, you understand that the idea isn't to ban people from interacting with Gemini, but rather, to stop them from using it to develop new models (i.e. using it's outputs as inputs for training another model)
Hmm, I took it to mean you couldn’t even ask about ML and you must avert your eyes when SGE pops up on ML queries on Google.
Anyway Google lost me as a customer for that so I promise not to help them “develop their models!”
https://ai.google.dev/gemma/terms https://ai.google.dev/gemma/prohibited_use_policy
Looks like there's no such restrictions? And this term does not reference the additional terms.
The issue I find problematic is "Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma." - it's a bit vague - I guess it's not enforceable!
2 replies →