Comment by paradite
3 days ago
Hey. I like your roast on benchmarks.
I also publish my own evals on new models (using coding tasks that I curated myself, without tools, rated by human with rubrics). Would love you to check out and give your thoughts:
Example recent one on GPT-5:
https://eval.16x.engineer/blog/gpt-5-coding-evaluation-under...
All results:
No comments yet
Contribute on Hacker News ↗