Comment by smokel
10 hours ago
I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way how far LLMs have come. Someone not in the know has a hard time understanding what 65.4% means on "Terminal-Bench 2.0". Comparing some crappy pelicans on bicycles is a lot easier.
it ceases to be a useful benchmark of general ability when you post it publicly for them to train against