Comment by m3at

2 years ago

For others that were confused by the Gemini versions: the main one being discussed is Gemini Ultra (which is claimed to beat GPT-4). The one available through Bard is Gemini Pro.

For the differences, looking at the technical report [1] on selected benchmarks, rounded score in %:

Dataset | Gemini Ultra | Gemini Pro | GPT-4

MMLU | 90 | 79 | 87

BIG-Bench-Hard | 84 | 75 | 83

HellaSwag | 88 | 85 | 95

Natural2Code | 75 | 70 | 74

WMT23 | 74 | 72 | 74

[1] https://storage.googleapis.com/deepmind-media/gemini/gemini_...

22 comments

m3at

Traubenfuchs 2 years ago

formatted nicely:

  Dataset        | Gemini Ultra | Gemini Pro | GPT-4

  MMLU           | 90           | 79         | 87

  BIG-Bench-Hard | 84           | 75         | 83

  HellaSwag      | 88           | 85         | 95

  Natural2Code   | 75           | 70         | 74

  WMT23          | 74           | 72         | 74

teleforce 2 years ago
Excellent comparison, it seems that GPT-4 is only winning in one dataset benchmark namely HellaSwag for sentence completion.
Can't wait to get my hands on Bard Advanced with Gemini Ultra, I for one welcome this new AI overlord.
- aroo 2 years ago
  
  Horrible comparison given one score was achieved using 32-shot CoT (Gemini) and the other was 5-shot (GPT-4).
  
  4 replies →
carbocation 2 years ago
I realize that this is essentially a ridiculous question, but has anyone offered a qualitative evaluation of these benchmarks? Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?
- p_j_w 2 years ago
  
  >Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?
  It doesn't mean that at all because Gemini Turbo isn't available in Bard yet.
  
  2 replies →
- tarruda 2 years ago
  
  I get what you mean, but what would such "qualitative evaluation" look like?
  
  4 replies →

nathanfig 2 years ago

Thanks, I was looking for clarification on this. Using Bard now does not feel GPT-4 level yet, and this would explain why.

dkarras 2 years ago

not even original chatgpt level, it is a hallucinating mess still. Did the free bard get an update today? I am in the included countries, but it feels the same as it has always been.

tiziano88 2 years ago

Permanent link to the result table contents: https://static.space/sha2-256:ea7e5d247afa8306cb84cbbd4438fd...

make3 2 years ago

the numbers are not at all comparable, because Gemini uses 34 shot and variable shot vs 5 for gpt 4. this is very deceptive of them.

bitshiftfaced 2 years ago

Yes and no. In the paper, they do compare apples to apples with GPT4 (they directly test GPT4's CoT@32 but state its 5-shot as "reported"). GPT4 wins 5-shot and Gemini wins CoT@32. It also came off to me like they were implying something is off about GPT4's MMLU.