Comment by ImageXav
1 year ago
Edit: the comment below refers to Gemini, not Gemma. As such the first paragraph is largely irrelevant, and only the second one applies.
To me, it feels as though the boat has been missed somewhat. The restrictions on Gemini make it unhelpful, but more than that, Claude 3 has really blown me away with its code suggestions. It's performing better than Mistral Large, GPT4 and Gemma in my tests, especially for large bits of code. It also returns the whole hog with changes, making it much easier to plug and play. Astonishingly, it also manages to combine ideas much better than any other LLM I've seen to date.
I suspect these fixes and the knowledge gained will be helpful to the community however, and will help improve the next iteration of models.
Claude 3 is very capable, but it is (likely) a 1T class model, not something that can be run on the edge, while 7B class models can already be run on phones and can be easily fine-tuned to do specialized work that can perform comparably to those big general models.
If you are talking to one model, by all means, use the best one you have available (personally, Claude not having a code interpreter/able to self-evaluate code still makes it oftentimes less useful than ChatGPT (or, even smaller open models like OpenCodeInterpreter - OpenCodeInterpreter-DS-33B outperforms all models including GPT-4 w/ CI on HumanEval+ and MBPP+ [1][2]). Recently I've been swapping between GPT4, Claude 3 Opus, and Phind for coding and finding that sometimes one will do better than another on specific tasks (sadly my GPUs are currently busy, but I really want to queue OCI-DS-33B up and do a shootout soon).
One issue with Gemma that doesn't get mentioned enough IMO is that while it claims to be 7B, it's really 8.54B parameters. It also has a gigantic tokenizer, so memory usage-wise, even quantized it is going to be significantly more than comparable 7B models. Once you are getting to 9B, you have other options as - the new Yi-9B, or if you want Apache licensed (stacked Mistral), you can use SOLAR-10.7B or the new bigstral-12b-32k.
[1] https://huggingface.co/m-a-p/OpenCodeInterpreter-DS-33B
[2] https://evalplus.github.io/leaderboard.html
Ye the gigantic tokenizer does eat up VRAM a lot. Although Gemma uses tied embeddings (ie lm_head == embeddings), this does make it use 50% less VRAM in terms of space, but still requires more VRAM since you have to add the gradients up at the end.
why are you comparing Claude 3, a ~14b and ~>200b model, to Gemma, a 2-7B model? of course it's going to do worse. the question for smol models is can it do good enough given a performance budget.
Does that give us more information about Gemma? The others are paywall'd best in class models with an order of magnitude higher parameter count.
It's possible that GP confused Gemma and Gemini.