← Back to context

Comment by m3at

2 years ago

For others that were confused by the Gemini versions: the main one being discussed is Gemini Ultra (which is claimed to beat GPT-4). The one available through Bard is Gemini Pro.

For the differences, looking at the technical report [1] on selected benchmarks, rounded score in %:

Dataset | Gemini Ultra | Gemini Pro | GPT-4

MMLU | 90 | 79 | 87

BIG-Bench-Hard | 84 | 75 | 83

HellaSwag | 88 | 85 | 95

Natural2Code | 75 | 70 | 74

WMT23 | 74 | 72 | 74

[1] https://storage.googleapis.com/deepmind-media/gemini/gemini_...

formatted nicely:

  Dataset        | Gemini Ultra | Gemini Pro | GPT-4

  MMLU           | 90           | 79         | 87

  BIG-Bench-Hard | 84           | 75         | 83

  HellaSwag      | 88           | 85         | 95

  Natural2Code   | 75           | 70         | 74

  WMT23          | 74           | 72         | 74

  • Excellent comparison, it seems that GPT-4 is only winning in one dataset benchmark namely HellaSwag for sentence completion.

    Can't wait to get my hands on Bard Advanced with Gemini Ultra, I for one welcome this new AI overlord.

  • I realize that this is essentially a ridiculous question, but has anyone offered a qualitative evaluation of these benchmarks? Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?

    • >Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?

      It doesn't mean that at all because Gemini Turbo isn't available in Bard yet.

      2 replies →

Thanks, I was looking for clarification on this. Using Bard now does not feel GPT-4 level yet, and this would explain why.

  • not even original chatgpt level, it is a hallucinating mess still. Did the free bard get an update today? I am in the included countries, but it feels the same as it has always been.

the numbers are not at all comparable, because Gemini uses 34 shot and variable shot vs 5 for gpt 4. this is very deceptive of them.

  • Yes and no. In the paper, they do compare apples to apples with GPT4 (they directly test GPT4's CoT@32 but state its 5-shot as "reported"). GPT4 wins 5-shot and Gemini wins CoT@32. It also came off to me like they were implying something is off about GPT4's MMLU.