Comment by m3at
2 years ago
For others that were confused by the Gemini versions: the main one being discussed is Gemini Ultra (which is claimed to beat GPT-4). The one available through Bard is Gemini Pro.
For the differences, looking at the technical report [1] on selected benchmarks, rounded score in %:
Dataset | Gemini Ultra | Gemini Pro | GPT-4
MMLU | 90 | 79 | 87
BIG-Bench-Hard | 84 | 75 | 83
HellaSwag | 88 | 85 | 95
Natural2Code | 75 | 70 | 74
WMT23 | 74 | 72 | 74
[1] https://storage.googleapis.com/deepmind-media/gemini/gemini_...
formatted nicely:
Excellent comparison, it seems that GPT-4 is only winning in one dataset benchmark namely HellaSwag for sentence completion.
Can't wait to get my hands on Bard Advanced with Gemini Ultra, I for one welcome this new AI overlord.
Horrible comparison given one score was achieved using 32-shot CoT (Gemini) and the other was 5-shot (GPT-4).
4 replies →
I realize that this is essentially a ridiculous question, but has anyone offered a qualitative evaluation of these benchmarks? Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?
>Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?
It doesn't mean that at all because Gemini Turbo isn't available in Bard yet.
2 replies →
I get what you mean, but what would such "qualitative evaluation" look like?
4 replies →
Thanks, I was looking for clarification on this. Using Bard now does not feel GPT-4 level yet, and this would explain why.
not even original chatgpt level, it is a hallucinating mess still. Did the free bard get an update today? I am in the included countries, but it feels the same as it has always been.
Permanent link to the result table contents: https://static.space/sha2-256:ea7e5d247afa8306cb84cbbd4438fd...
the numbers are not at all comparable, because Gemini uses 34 shot and variable shot vs 5 for gpt 4. this is very deceptive of them.
Yes and no. In the paper, they do compare apples to apples with GPT4 (they directly test GPT4's CoT@32 but state its 5-shot as "reported"). GPT4 wins 5-shot and Gemini wins CoT@32. It also came off to me like they were implying something is off about GPT4's MMLU.