Comment by lhl
1 year ago
Claude 3 is very capable, but it is (likely) a 1T class model, not something that can be run on the edge, while 7B class models can already be run on phones and can be easily fine-tuned to do specialized work that can perform comparably to those big general models.
If you are talking to one model, by all means, use the best one you have available (personally, Claude not having a code interpreter/able to self-evaluate code still makes it oftentimes less useful than ChatGPT (or, even smaller open models like OpenCodeInterpreter - OpenCodeInterpreter-DS-33B outperforms all models including GPT-4 w/ CI on HumanEval+ and MBPP+ [1][2]). Recently I've been swapping between GPT4, Claude 3 Opus, and Phind for coding and finding that sometimes one will do better than another on specific tasks (sadly my GPUs are currently busy, but I really want to queue OCI-DS-33B up and do a shootout soon).
One issue with Gemma that doesn't get mentioned enough IMO is that while it claims to be 7B, it's really 8.54B parameters. It also has a gigantic tokenizer, so memory usage-wise, even quantized it is going to be significantly more than comparable 7B models. Once you are getting to 9B, you have other options as - the new Yi-9B, or if you want Apache licensed (stacked Mistral), you can use SOLAR-10.7B or the new bigstral-12b-32k.
Ye the gigantic tokenizer does eat up VRAM a lot. Although Gemma uses tied embeddings (ie lm_head == embeddings), this does make it use 50% less VRAM in terms of space, but still requires more VRAM since you have to add the gradients up at the end.