Maybe I am missing something obvious on the website, but where is the documentation? Where do you explain what each number mean, or at least a short overview of what the models are being tested on?
You can hover over some stuff, click on the model to get more info like tested categories, hover the correct test numbers to see some info about what they got wrong.
I just started on this, so currently adding more tests and I keep improving the UI. Let me know if you have any suggestions.
The ranking currently is mostly about the "smartest" model, which is most likely to respond correctly to any given question or request, regardless of the domain.
Yes. Opus could do a lot better, but fails a lot because it doesn't respect the given formatting instructions/output format.
I could modify the tests to emphasize the requirements, but then, what's the point of a test. In real life, we expect the AI to do something if we ask it, especially for agentic use-case or in n8n, because if the output is slightly wrong, the entire workflow fails.
Interesting. This has to do with the "instruction following" aspect, right? I saw that GPT models do a lot higher than Claude on those benchmarks.
I haven't done my own tests, but I did notice a lot of models are very low there. You'll give them specific instructions and they'll ignore them and just pattern match to whatever was the format they saw most commonly during training.
Maybe I am missing something obvious on the website, but where is the documentation? Where do you explain what each number mean, or at least a short overview of what the models are being tested on?
You can hover over some stuff, click on the model to get more info like tested categories, hover the correct test numbers to see some info about what they got wrong.
I just started on this, so currently adding more tests and I keep improving the UI. Let me know if you have any suggestions.
The ranking currently is mostly about the "smartest" model, which is most likely to respond correctly to any given question or request, regardless of the domain.
In your benchmark, GPT 5 Nano is basically tied with Opus?
Yes. Opus could do a lot better, but fails a lot because it doesn't respect the given formatting instructions/output format.
I could modify the tests to emphasize the requirements, but then, what's the point of a test. In real life, we expect the AI to do something if we ask it, especially for agentic use-case or in n8n, because if the output is slightly wrong, the entire workflow fails.
Interesting. This has to do with the "instruction following" aspect, right? I saw that GPT models do a lot higher than Claude on those benchmarks.
I haven't done my own tests, but I did notice a lot of models are very low there. You'll give them specific instructions and they'll ignore them and just pattern match to whatever was the format they saw most commonly during training.
1 reply →
Also, not really tied, Opus has a lot better consistency and reasoning score (which means the reasoning made sense, only the final output was wrong).