Comment by carbocation

2 years ago

I realize that this is essentially a ridiculous question, but has anyone offered a qualitative evaluation of these benchmarks? Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?

8 comments

carbocation

p_j_w 2 years ago

>Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?

It doesn't mean that at all because Gemini Turbo isn't available in Bard yet.

carbocation 2 years ago
I am not sure what Gemini Turbo is. Perhaps you meant Gemini Ultra? Because Gemini Pro (which is in this table) is currently accessible in Bard.
- p_j_w 2 years ago
  
  Yes, that's what I meant.

tarruda 2 years ago

I get what you mean, but what would such "qualitative evaluation" look like?

carbocation 2 years ago
I think my ideal might be as simple as a few people who spend a lot of time with various models describing their experiences in separate blog posts.
- tarruda 2 years ago
  
  I see.
  I can't give any anecdotal evidence on ChatGPT/Gemini/Bard, but I've been running small LLMs locally over the past few months and have amazing experience with these two models:
  - https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B (general usage)
  - https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instr... (coding)
  OpenChat 3.5 is also very good for general usage, but IMO NeuralHermes surpassed it significantly, so I switched a few days ago.
  
  2 replies →