← Back to context

Comment by Palmik

2 years ago

The table is *highly* misleading. It uses different methodologies all over the place.

For MMLU, it highlights the CoT @ 32 result, where Ultra beats GPT4, but it loses to GPT4 with 5-shot, for example.

For GSM8K it uses Maj1@32 for Ultra and 5-shot CoT for GPT4, etc.

Then also, for some reason, it uses different metrics for Ultra and Pro, making them hard to compare.

What a mess of a "paper".

It really feels like the reason this is being released now and not months ago is that that's how long it took them to figure out the convoluted combination of different evaluation procedures to beat GPT-4 on the various benchmarks.

Why is that misleading? It shows Gemini with CoT is the best known combination of prompt and LLM on MMLU.

They simply compare the prompting strategies that work best with each model. Otherwise it would be just a comparison of their response to specific prompt engineering.

  • > They simply compare the prompting strategies that work best with each model

    Incorrect.

    # Gemini marketing website, MMLU

    - Gemini Ultra 90.0% with CoT@32*

    - GPT-4 86.4% with 5-shot* (reported)

    # gemini_1_report.pdf, MMLU

    - Gemini Ultra 90.0% with CoT@32*

    - Gemini Ultra 83.7% with 5-shot

    - GPT-4 87.29% with CoT@32 (via API*)

    - GPT-4 86.4% with 5-shot (reported)

    Gemini marketing website compared best Gemini Ultra prompting strategy with a worse-performing (5-shot) GPT-4 prompting strategy.

The places where they use the same methodology seem within the error bars of the cherry picked benchmarks they selected. Maybe for some tasks it's roughly comparable to GPT4 (still a major accomplishment for Google to come close to closing the gap for the current generation of models), but this looks like someone had the goal of showing Gemini beating GPT4 in most areas and worked back from there to figure out how to get there.