Comment by KaoruAoiShiho
2 years ago
Pro benchmarks are here: https://storage.googleapis.com/deepmind-media/gemini/gemini_...
Sadly it's 3.5 quality, :(
2 years ago
Pro benchmarks are here: https://storage.googleapis.com/deepmind-media/gemini/gemini_...
Sadly it's 3.5 quality, :(
Lol that's why it's hidden in a PDF.
They basically announced GPT 3.5, then. Big woop, by the time Ultra is out GPT-5 is probably also out.
Isn't having GPT 3.5 still a pretty big deal? Obviously they are behind but does anyone else offer that?
3.5 is still highly capable and Google investing a lot into making it multi modal combined with potential integration with their other products makes it quite valuable. Not everyone likes having to switch to ChatGPT for queries.
Yeah, right now the leaderboard is pretty much: GPT4 > GPT 3.5 > Claude > Llama2. If Google just released something (Gemini Pro) on par with GPT 3.5 and will release something (Gemini Ultra) on par with GPT 4 in Q1 of next year while actively working on Gemini V2, they are very much back in the game.
2 replies →
Obviously they are behind but does anyone else offer that?
Claude by Anthropic is out and offers more and is being actively used
I thought there were some open-source models in the 70-120B range that were GPT3.5 quality?
2 replies →
Yup, it's all a performance for the investors
+1. The investors are the customers of this release, not end users.
Table 2 indicates Pro is generally closer to 4 than 3.5 and Ultra is on par with 4.
If you think eval numbers mean a model is close to 4, then you clearly haven't been scarred by the legions of open source models which claim 4-level evals but clearly struggle to actually perform challenging work as soon as you start testing
Perhaps Gemini is different and Google has tapped into their own OpenAI-like secret sauce, but I'm not holding my breath
Ehhh not really, it even loses to 3.5 on 2/8 tests. For me it feels pretty lackluster considering I'm using GPT-4 probably close to 100 times or more a day and it would be a huge downgrade.
Pro is approximately in the middle between GPT 3.5 and GPT 4 on four measures (MMLU, BIG-Bench-Hard, Natural2Cod, DROP), it is closer to 3.5 on two (MATH, Hellaswag), and closer to four on the remaining two (GSM8K, HumanEval). Two one way, two the other way, and four in the middle.
So it's a split almost right down the middle, if anything closer to 4, at least if you assume the benchmarks to be of equal significance.
1 reply →