Comment by aspenmartin

5 days ago

Yes but there’s a reason we don’t evaluate these models this way and instead do it as carefully and thoughtfully as we can at scale. Human evaluations are important but they are an absolute minefield of footguns. 4.8 is not a downgrade from 4.6 there is an insane amount of hard data that contradicts this.

The flip side is that benchmarks are gamed even by the top labs. Benchmark performance doesn't necessarily correlate with real world performance.

  • Again correct but it overstates the issue. I can say labs don’t want this. This happened arguably unintentionally in Metas llama 4 release, it went horribly, heads rolled, and like several billion dollars were paid for new talent and the org that built llama 4 was destroyed.

    Evals come from a million places and new evals and robust perturbations of existing evals abound. They test a variety of tasks in a variety of ways. All of them individually are flawed. Taken together the aggregate signal is highly useful as you more or less marginalize over a lot of different things. Not to mention these companies have plenty of proprietary internal measurements, they build benchmarks themselves to probe their models and then also have flywheel traffic and A/B tests.

    You are right to call out benchmarks but to dismiss them or not take them seriously is a mistake.

    • Listen, you can say “but benchmarks, the benchmarks!” all day long, but consumer know when we are being sold a lemon. If it can’t do the most basic of things at least as good as it used to, this is table stakes. Nevermind that if you can’t do the basic stuff, how on earth can you be trusted with more?

      3 replies →

Actually anecdata I gather on my job from myself and coworkers is the only benchmark I trust anymore, because it so heavily diverges from the “benchmarks”.

  • That’s your call just don’t expect anyone ever to take that seriously. It’s not like we don’t have exact evaluations like this.

    • I would encourage you to look into the open evals of some of these benchmarks (find one that actually is open-data, this is itself a good challenge), read the results generated and assess them for yourself.

      This is what myself and my coworkers (and many other people in this thread) are doing on a daily basis with real stakes and real tasks – which these benchmarks are all aiming to be a proxy for. There's a real, tangible [cost]benefit to [not] using the highest-ROI models and harnesses.

      The people with real incentives and skin in the game are telling you that the data diverges from "the data".

      I don't mind if you don't take it seriously, our jobs are more important to us than a benchmark is.

      But I wouldn't opt-out of using your own eyes and the eyes of others so easily, especially when there are literally hundreds of billions of dollars in invested capital with an interest in a certain outcome... this is how you end up in "Emperor's New Clothes" situations.

      3 replies →

"Carefully and thoughtfully" is antithetical to the approach to benchmarks these days.

Maybe back when this was a scientific endeavor; not now when enormous, enormous amounts of capital are on the line. Along with an entire cult's chosen eschatology.

  • You can call it a cult but it’s several thousand skilled workers who know what they’re doing, by and large, most of whom have a PhD and know how science and statistics work. Benchmarks are incredibly hard, and any PR or comms department at any company is going to obviously want to make things as rosy as possible, but beneath this are earnest, expensive efforts to get good quality measurements. The better you can do this the better you can compete. If you want to make a modeling decision you run an ablation, and the quality of that decision is only as good as your measurements.

    • The cult in this case is TESCREAL, not everyone working on AI. Last I checked not all the "several thousand skilled workers" in AI subscribe to TESCREAL ideology, although it has been a while since I've been to the Bay. Maybe things have changed since my time at Berkeley, and Dario's belief that he will eventually be made immortal by mind uploading is more widespread.

      Otherwise we agree that benchmarking is hard, the benchmarks contain hard problems, and that there are many hard working people trying to accurately gauge what is going on. It is getting harder to watch though as all that is on the line taints the overall endeavor.

There is no data that I would trust that contradicts it.

Frankly I don't give a damn about data that could be made up on the spot or appears to be scientific or meaningful while it's not at all clear how it was made (up).

Claude was heavily lobotomised for my work starting somewhen in February.

I talked to friends and people I know and trust and many felt the same. (I didn't ask them whether they felt like I did, but what they felt, how happy they were with agentic coding etc.)

I quit my abo in March and talked to said friends who are still on a plan just last week: they are still not happy, but company pays so whatever...

  • That’s ok but at what point is this getting into conspiracy territory? You have just said there is nothing you would believe to the contrary, but then by definition that’s not exactly a very thoughtful or insightful position.

    • I never said that I am not willing to believe the contrary.

      I am not willing to believe the contrary from strangers on the interwebs or PR departments of companies who want to sell me something.

      If people I genuinely trust tell me about their experiences, I am willing to try again.

      But yes, if it doesn't work for me (for whatever reason, could be that I am holding it wrong), then I can accept that it works for everyone but me and still not use it.

      Also "scientific" doesn't mean what it used to mean. When the n is small or it's just anecdotes (I am aware of the irony) blown out of proportion I really can't take the data and conclusions seriously

      4 replies →

Seems like a bunch of noise. What does this even mean?

It sounds like you're saying "Actually you, as a human, are simply not smart enough to evaluate Opus 4.8"

  • No it’s: evaluating these systems are complex and there’s a reason why sociology, cognitive psychology, medicine, etc are all done in careful double blind conditions with pre registered tests. It’s not that humans are not smart enough, as I said human evaluations are incredibly important. And yet they are a minefield of biases you have to worry about and correct for.

    - evaluations need to be done at the same time to avoid drift in your bias

    - you need to worry about your test set: which questions are you asking? How many of them? Are they representative of your work?

    - which one did you do first? Raters have a tendency to bias in one direction or another

    - you also know the label! You know which model is which! This biases your assessment…

    And on and on and on. Careful science exists for a reason.