← Back to context

Comment by keepamovin

5 hours ago

Funny, I just made https://model-tracker.com because model performance change all the time, and it would be good to have a subjective signal of what people are actually feeling today. And also, benchmarks are flaky af as this paper shows.

The idea is knowing what to try first today saves a bit of time.

I would love to see a stable test over time with a hold out set of easy/medium/hard challenges. I, like many others, have noticed a large drop in recent performance w/ Claude Opus (and Sonnet) and more sites like these would hold the labs more accountable to sneaky backend changes that nerf/degrade performance.

working on something similar to evaluate model performance over time using tasks based on your own code. obviously this is still susceptible to the same hacking mechanics documented here, but at a local level, it's easier to detect/fix, and should give a stronger signal of subjective harness/agent/context performance than these large generic benchmarks

also I keep hearing complaints that opus is nerfed, but IMO it's nice to have objective data to back that. I feel like half of the nerfing complaints are people getting past honeymoon phase...