Comment by verve_rat
8 hours ago
My theory is we will end up in a similar spot to hiring people. You can look at a CV (benchmarks) but you won't know for sure until you've worked with them for six months.
We as an industry cannot determine if one software engineer is objectively better than another, on practically any dimension, so why do we think we can come to an objective ranking of models?
Yes, the entire field of software engineering ran aground on not being able to test how well people can write software.
But I'm more optimistic about testing programming models. You can run repeated tests, and compare median performance. You can run long tests, like hundreds of hours, while getting more than a few humans to complete half-day tests is a huge project. And you can do ablation testing, where you remove some feature of the environment or tools and see how much it helps/hurts.
The CV-to-six-months analogy is actually exactly right and it's also why benchmarks for hiring people stopped being useful. The signal that holds up is what you see when something breaks, which is hard to compress into a number.
this smells like an ai-generated comment so much
Not many things are as manifold broken as hiring these days. I hope we do not end up there.
Terrible comparison. CV is just a list, telling you barely anything about performance and that's when candidate is not lying to get thru HR filter.
And we can judge developer performance, it just takes 6 months to a year working with a team so it's just hard to get metric
You do not interview 1000 rounds on problems you're actually solving. If you did, hiring would be fine. Minus the social fit aspect, which isn't as relevant for a model.