← Back to context

Comment by daveguy

2 years ago

Table 2 indicates Pro is generally closer to 4 than 3.5 and Ultra is on par with 4.

If you think eval numbers mean a model is close to 4, then you clearly haven't been scarred by the legions of open source models which claim 4-level evals but clearly struggle to actually perform challenging work as soon as you start testing

Perhaps Gemini is different and Google has tapped into their own OpenAI-like secret sauce, but I'm not holding my breath

Ehhh not really, it even loses to 3.5 on 2/8 tests. For me it feels pretty lackluster considering I'm using GPT-4 probably close to 100 times or more a day and it would be a huge downgrade.

  • Pro is approximately in the middle between GPT 3.5 and GPT 4 on four measures (MMLU, BIG-Bench-Hard, Natural2Cod, DROP), it is closer to 3.5 on two (MATH, Hellaswag), and closer to four on the remaining two (GSM8K, HumanEval). Two one way, two the other way, and four in the middle.

    So it's a split almost right down the middle, if anything closer to 4, at least if you assume the benchmarks to be of equal significance.

    • > at least if you assume the benchmarks to be of equal significance.

      That is an excellent point. Performance of Pro will definitely depend on the use case given the variability between 3.5 to 4. It will be interesting to see user reviews on different tasks. But the 2 quarter lead time for Ultra means it may as well not be announced. A lot can happen in 3-6 months.