Comment by carbocation
2 years ago
I realize that this is essentially a ridiculous question, but has anyone offered a qualitative evaluation of these benchmarks? Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?
>Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?
It doesn't mean that at all because Gemini Turbo isn't available in Bard yet.
I am not sure what Gemini Turbo is. Perhaps you meant Gemini Ultra? Because Gemini Pro (which is in this table) is currently accessible in Bard.
Yes, that's what I meant.
I get what you mean, but what would such "qualitative evaluation" look like?
I think my ideal might be as simple as a few people who spend a lot of time with various models describing their experiences in separate blog posts.
I see.
I can't give any anecdotal evidence on ChatGPT/Gemini/Bard, but I've been running small LLMs locally over the past few months and have amazing experience with these two models:
- https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B (general usage)
- https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instr... (coding)
OpenChat 3.5 is also very good for general usage, but IMO NeuralHermes surpassed it significantly, so I switched a few days ago.
2 replies →