Comment by sowbug
1 day ago
Why don't they ask their premier model to generate a bench for them?
It's not a crazy idea. Have the older model interview the newer one and then ask both (or maybe a third referee model) which one they think is smarter. Repeat 100x with different seeds. The percentage of times both sides agree the newer model won is the score.
No comments yet
Contribute on Hacker News ↗