Comment by caesil

2 years ago

If you think eval numbers mean a model is close to 4, then you clearly haven't been scarred by the legions of open source models which claim 4-level evals but clearly struggle to actually perform challenging work as soon as you start testing

Perhaps Gemini is different and Google has tapped into their own OpenAI-like secret sauce, but I'm not holding my breath