Comment by aubanel
1 year ago
He's speaking about his objective to make ever stronger LLMs: so for this his secondary objective is to measure their real performance.
The human preference is not that good of a proxy measurement: for instance, it can be gamed by making the model more assertive, causing the human error-spotting ability to decrease a lot [0].
So what he's really saying is that non-rigorous human vibe checks (like those LMSys Chatbot Arena is built on, although I love it) won't cut it anymore to evaluate models, because now models are past that point. Just like you can't evaluate how smart a smart person really is in a 2min casual conversation.
It's trivial to come up with prompts that 4o fails. If it's hard to come up with prompts that 1o succeeds on but 4o fails, that implies the delta is not that great.
Or, the delta depends on the nature of the problem/prompt, we’ve not yet figured that out, there’s a relatively narrow range of prompts with large delta, and so finding those examples is a work in progress?
ie when you cant beat them, make new metrics
and you can absolutely evaluate how smart someone is in a 2min casual conversation. You wont be able to tell how well they are in some niche topic, but %insert something about different flavors of intelligence and how they do not equate do subject matter expertise%
It’s a common pattern that AI benchmarks get too easy, so they make new ones that are harder.
As models improve, human preference will become worse as a proxy measurement (e.g. as model capabilities surpass the human's ability to judge correctness at a glance). This can be due to more raw capability - or more persuasion / charisma.