Comment by Palmik

2 years ago

The table is *highly* misleading. It uses different methodologies all over the place.

For MMLU, it highlights the CoT @ 32 result, where Ultra beats GPT4, but it loses to GPT4 with 5-shot, for example.

For GSM8K it uses Maj1@32 for Ultra and 5-shot CoT for GPT4, etc.

Then also, for some reason, it uses different metrics for Ultra and Pro, making them hard to compare.

What a mess of a "paper".

10 comments

Palmik

Imnimo 2 years ago

It really feels like the reason this is being released now and not months ago is that that's how long it took them to figure out the convoluted combination of different evaluation procedures to beat GPT-4 on the various benchmarks.

mring33621 2 years ago

"Dearest LLM: Given the following raw benchmark metrics, please compose an HTML table that cherry-picks and highlights the most favorable result in each major benchmark category"
rvnx 2 years ago

And somehow, when reading the benchmarks, Gemini Pro seems to be a regression compared to PaLM 2-L (the current Bard) :|
eurekin 2 years ago
This, and also building the marketing website.
It feels really desperate
- red-iron-pine 2 years ago
  
  "we have no moat"
  
  1 reply →

hulium 2 years ago

Why is that misleading? It shows Gemini with CoT is the best known combination of prompt and LLM on MMLU.

They simply compare the prompting strategies that work best with each model. Otherwise it would be just a comparison of their response to specific prompt engineering.

noway421 2 years ago

> They simply compare the prompting strategies that work best with each model
Incorrect.
# Gemini marketing website, MMLU
- Gemini Ultra 90.0% with CoT@32*
- GPT-4 86.4% with 5-shot* (reported)
# gemini_1_report.pdf, MMLU
- Gemini Ultra 90.0% with CoT@32*
- Gemini Ultra 83.7% with 5-shot
- GPT-4 87.29% with CoT@32 (via API*)
- GPT-4 86.4% with 5-shot (reported)
Gemini marketing website compared best Gemini Ultra prompting strategy with a worse-performing (5-shot) GPT-4 prompting strategy.

viscanti 2 years ago

The places where they use the same methodology seem within the error bars of the cherry picked benchmarks they selected. Maybe for some tasks it's roughly comparable to GPT4 (still a major accomplishment for Google to come close to closing the gap for the current generation of models), but this looks like someone had the goal of showing Gemini beating GPT4 in most areas and worked back from there to figure out how to get there.