Comment by smokel

10 hours ago

I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way how far LLMs have come. Someone not in the know has a hard time understanding what 65.4% means on "Terminal-Bench 2.0". Comparing some crappy pelicans on bicycles is a lot easier.

1 comment

smokel

blibble 8 hours ago

it ceases to be a useful benchmark of general ability when you post it publicly for them to train against