← Back to context

Comment by aspenmartin

5 days ago

If benchmarks didn’t exist we would have to invent them because “vibes” is a ridiculous idea: oh I know I’ll be super unscientific and horrendously biased and that’s far better than a team of experts carefully AND CONTINUALLY developing a variety of benchmarks of varying quality that…hmm all point to the same thing.

You can’t benchmaxx an eval that comes after your model release.

Consider also benchmaxxing makes no sense from an incentive structure: the quality of these models is directly correlated by how well you can measure true performance in the wild. If they were just stupidly benchmaxxing they would be unable to do trustworthy ablations or know how well the model will perform in their product.

Remember the famous case of asserted benchmaxxing from llama 4? The entire org was gutted and the ceo spent billions hiring better people. Every lab takes evaluations extremely seriously.

> You can’t benchmaxx an eval that comes after your model release

Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.

  • > Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.

    This is...just incredibly conspiratorial and a bit silly. You can make a benchmark right now and run it on the models. They'll have a benchmaxxed model on your...previously non-existent benchmark? I mean: if models really were overfit to benchmarks, which zero lab is doing because its idiotic, against their incentive structure, and easy to detect, then why would we see a slow ascension of performance on say humanity's last exam for one benchmark example? You could trivially get those numbers to close to 100% if you wanted to.

    • I'm not suggesting anyone is doing anything, just stating the objective fact that it is definitely possible for closed-weight model developers, and would be super hard to detect outside of this limit scenario you posit, where it is provably impossible for the provider to have seen the benchmark before it was run (which of course would mean that the benchmark was created entirely "by hand" or using some other provider that is unconnected to the provider you are benchmarking).

      To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.

      1 reply →

    • > This is...just incredibly conspiratorial and a bit silly.

      Do you think? Have you seen the insane valuations at which the AI companies are going to do their IPOs? They surely leave no idea off the table when hundreds of billions of USD are on the line. You could even say they'd be negligent if they'd not at least explore those avenues.

      1 reply →

Vibes is just UX. There's whole careers, teams, and even industries dedicated to it, and yeah it isn't easy because you need aggregate data from people.

  • Um kind of but not really, it’s a mix of UX and actual measurements of what tasks it can do. Also UX is virtually the same thing: scaled quantitative surveys and preference metrics. It’s again, just benchmarking, and it’s done carefully and with best practices.