Comment by siva7

3 months ago

I have a few secret prompts to test complex reasoning capabilities of new models (in law and medicine). Gemini (2.5 pro) is by a wide margin behind Anthropic (sonnet 4.5 basic thinking) and Openai (pro model) on my own benchmark and I trust my own benchmark more than public leaderboards. So it's the other way around. Google is trying to catch up where the others are. It just doesn't seem so to some because Google undercuts prices and most people don't have own complex problems with a verified solution to test against (so they could see how bad Gemini is in reality)

1 comment

siva7

alecco 3 months ago

This thread is about Gemini 3. It will be interesting to see your benchmark results when it's available later.