← Back to context

Comment by forrestthewoods

6 hours ago

At the end of the day “feel” is what people rely on to pick which tool they use.

I’d feel unscientific and broken? Sure maybe why not.

But at the end of the day I’m going to choose what I see with my own two eyes over a number in a table.

Benchmarks are a sometimes useful to. But we are in prime Goodharts Law Territory.

2 comments

forrestthewoods

Reply

AstroBen 6 hours ago

yeah, to be honest it probably doesn't matter too much. I think the major models are very close in capabilities

forrestthewoods 6 hours ago

I don’t think this is even remotely true in practice.
I honestly I have no idea what benchmarks are benchmarking. I don’t write JavaScript or do anything remotely webdev related.
The idea that all models have very close performance across all domains is a moderately insane take.
At any given moment the best model for my actual projects and my actual work varies.
Quite honestly Opus 4.5 is proof that benchmarks are dumb. When Opus 4.5 released no one was particularly excited. It was better with some slightly large numbers but whatever. It took about a month before everyone realized “holy shit this is a step function improvement in usefulness”. Benchmarks being +15% better on SWE bench didn’t mean a damn thing.