Comment by bisonbear
3 hours ago
working on something similar to evaluate model performance over time using tasks based on your own code. obviously this is still susceptible to the same hacking mechanics documented here, but at a local level, it's easier to detect/fix, and should give a stronger signal of subjective harness/agent/context performance than these large generic benchmarks
also I keep hearing complaints that opus is nerfed, but IMO it's nice to have objective data to back that. I feel like half of the nerfing complaints are people getting past honeymoon phase...
No comments yet
Contribute on Hacker News ↗