Comment by Traubenfuchs

2 years ago

formatted nicely:

  Dataset        | Gemini Ultra | Gemini Pro | GPT-4

  MMLU           | 90           | 79         | 87

  BIG-Bench-Hard | 84           | 75         | 83

  HellaSwag      | 88           | 85         | 95

  Natural2Code   | 75           | 70         | 74

  WMT23          | 74           | 72         | 74

15 comments

Traubenfuchs

teleforce 2 years ago

Excellent comparison, it seems that GPT-4 is only winning in one dataset benchmark namely HellaSwag for sentence completion.

Can't wait to get my hands on Bard Advanced with Gemini Ultra, I for one welcome this new AI overlord.

aroo 2 years ago
Horrible comparison given one score was achieved using 32-shot CoT (Gemini) and the other was 5-shot (GPT-4).
- throwaway287391 2 years ago
  
  CoT@32 isn't "32-shot CoT"; it's CoT with 32 samples (or rollouts) from the model, and the answer is taken by consensus vote from those rollouts. It doesn't use any extra data, only extra compute. It's explained in the tech report here:
  > We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach (Wei et al., 2022) that accounts for model uncertainty. The model produces a chain of thought with k samples, for example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood choice without chain of thought.
  (They could certainly have been clearer about it -- I don't see anywhere they explicitly explain the CoT@k notation, but I'm pretty sure this is what they're referring to given that they report CoT@8 and CoT@32 in various places, and use 8 and 32 as the example numbers in the quoted paragraph. I'm not entirely clear on whether CoT@32 uses the 5-shot examples or not, though; it might be 0-shot?)
  The 87% for GPT-4 is also with CoT@32, so it's more or less "fair" to compare that Gemini's 90% with CoT@32. (Although, getting to choose the metric you report for both models is probably a little "unfair".)
  It's also fair to point out that with the more "standard" 5-shot eval Gemini does do significantly worse than GPT-4 at 83.7% (Gemini) vs 86.4% (GPT-4).
  
  3 replies →

carbocation 2 years ago

I realize that this is essentially a ridiculous question, but has anyone offered a qualitative evaluation of these benchmarks? Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?

p_j_w 2 years ago
>Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?
It doesn't mean that at all because Gemini Turbo isn't available in Bard yet.
- carbocation 2 years ago
  
  I am not sure what Gemini Turbo is. Perhaps you meant Gemini Ultra? Because Gemini Pro (which is in this table) is currently accessible in Bard.
  
  1 reply →
tarruda 2 years ago
I get what you mean, but what would such "qualitative evaluation" look like?
- carbocation 2 years ago
  
  I think my ideal might be as simple as a few people who spend a lot of time with various models describing their experiences in separate blog posts.
  
  3 replies →