Comment by rmi_

14 hours ago

Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.

I'm not saying it's bad, but it's definitely different than the others.

4 comments

rmi_

Reply

XCSme 13 hours ago

The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.

BoorishBears 13 hours ago
> It is not my fault if Claude outputs something like "*1*, *1*", adding markdown highlighting, when most other models respect the required format correctly.
Yuck. At that point don't publish a benchmark, explains why their results are useless too.
-
Edit since I'm not able to reply to the below comment:
"I want structured output from a model that supports structured output but will not enable structured output, nor ask for an existing format like XML or JSON" is not really an interesting thing to benchmark, and that's reflected in how you have Gemini 2.5 Flash beating GPT-5.4.
I really hope no one reads that list and thinks it's an AI leaderboard in any generalizable sense.
- XCSme 13 hours ago
  
  Why not? I described this in more detail in other comments.
  Even when using structured output, sometimes you want to define how the data should be displayed or formatted, especially for cases like chat bots, article writing, tool usage, calling external api's, parsing documents, etc.
  Most models get this right. Also, this is just one failure mode of Claude.
  
  1 reply →