Comment by siva7
1 month ago
This is probably the greatest one-time AI "Benchmark" ever made. The foundation companies have been gaming traditional benchmarks for years so that no one can really match those numbers into real-world experience. Car wash test tells me on the other hand what kind of intelligence i can expect.
I also don't trust the maxbenched results.
I am thus making my own benchmarks: https://aibenchy.com
Maybe I am missing something obvious on the website, but where is the documentation? Where do you explain what each number mean, or at least a short overview of what the models are being tested on?
You can hover over some stuff, click on the model to get more info like tested categories, hover the correct test numbers to see some info about what they got wrong.
I just started on this, so currently adding more tests and I keep improving the UI. Let me know if you have any suggestions.
The ranking currently is mostly about the "smartest" model, which is most likely to respond correctly to any given question or request, regardless of the domain.
In your benchmark, GPT 5 Nano is basically tied with Opus?
Yes. Opus could do a lot better, but fails a lot because it doesn't respect the given formatting instructions/output format.
I could modify the tests to emphasize the requirements, but then, what's the point of a test. In real life, we expect the AI to do something if we ask it, especially for agentic use-case or in n8n, because if the output is slightly wrong, the entire workflow fails.
2 replies →
Also, not really tied, Opus has a lot better consistency and reasoning score (which means the reasoning made sense, only the final output was wrong).
For me it's interesting because no normal person I know would ever inject "because its better for the environment" in anything so small scale so not only it shows they suck, it shows how easy it is to inject side-ideology into simple exchanges.
You don’t know enough people, then. There are a lot of environmentally conscious people who would absolutely first think “because it is close we should walk” and then follow up with the logical conclusion that you can’t walk to wash your car. Many people communicate by sharing their thinking process, I can think of many people who would share their ideology as it pertains to a question like this. A pragmatic environmentalist (hopefully that is all of them) would know that their ideology isn’t consequential but could certainly mention it. After all, you may need to drive your car to the car wash to wash it, but do you need to wash it? Are the chemicals used by the car wash harmful? Are there better ways to keep a car maintained?
Referring to "the normal people you know" is purely anecdotal evidence and can't be used to infer anything at all about "side-ideology". Perhaps you only know people that don't care about the environment?
Majority of people I know care about the environment but they would never inject a phrase like that in a quick exchange about going to wash the car 50m away is my point. In wanting to be a pure heart you missed the actual point.
1 reply →