Comment by segmondy
3 days ago
garbage benchmark, inconsistent mix of "agent tools" and models. if you wanted to present a meaningful benchmark, the agent tools will stay the same and then we can really compare the models.
there are plenty of other benchmarks that disagree with these, with that said. from my experience most of these benchmarks are trash. use the model yourself, apply your own set of problems and see how well it fairs.
Hey. I like your roast on benchmarks.
I also publish my own evals on new models (using coding tasks that I curated myself, without tools, rated by human with rubrics). Would love you to check out and give your thoughts:
Example recent one on GPT-5:
https://eval.16x.engineer/blog/gpt-5-coding-evaluation-under...
All results:
https://eval.16x.engineer/evals/coding
Which benchmarks are not garbage?
I don't consider myself super special. I think it should be doable to create a benchmark that beats me having to test every single new model.