Comment by codeflo

1 year ago

At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training. That's not something specific to LLMs or OpenAI. Compiler companies have done the same thing for decades, specifically detecting common benchmark programs and inserting hand-crafted optimizations. Similarly, the shader compilers in GPU drivers have special cases for common games and benchmarks.

VW got in a lot of trouble for this

  • Apples and oranges. VW actually cheated on regulatory testing to bypass legal requirements. So to be comparable, the government would first need to pass laws where e.g. only compilers that pass a certain benchmark are allowed to be used for purchasable products and then the developers would need to manipulate behaviour during those benchmarks.

    • There's a sliding scale of badness here. The emissions cheating (it wasn't just VW, incidentally; they were just the first uncovered. Fiat-Chrysler, Mercedes, GM and BMW were also caught doing it, with suspicions about others) was straight-up fraud.

      It used to be common for graphics drivers to outright cheat on benchmarks (the actual image produced would not be the same as it would have been if a benchmark had not been detected); this was arguably, fraud.

      It used to be common for mobile phone manufacturers to allow the SoC to operate in a thermal mode that was never available to real users when it detected a benchmark was being used. This is still, IMO, kinda fraud-y.

      Optimisation for common benchmark cases where the thing still actually _works_, and where the optimisation is available to normal users where applicable, is less egregious, though, still, IMO, Not Great.

    • Tesla cheats by using electric motors and deferring emissions standards to somebody else :D Wait, I really think that's a good thing, but once Hulk Hogan is confirmed administrator of the EPA, he might actually use this argument against Teslas and other electric vehicles.

  • True. But they did not optimize for a specific case. They detected the test and then enabled a special regime, that was not used normally.

    It’s as if OpenAI detects the IP address from a benchmark organization, and then used a completely different model.

    • This is the apples to apples version. Perhaps might be more accurate to say that when detecting a benchmark attempt the model tries the prompt 3 times with different seeds then picks the best answer, otherwise it just zero-shots the prompt in everyday use.

      I say this because the be test still uses the same hardware (model) but changed the way it behaved by running emissions friendly parameters ( a different execution framework) that wouldn’t have been used in everyday driving, where fuel efficiency and performance optimized parameters were used instead.

      What I’d like to know is if it actually was unethical or not. The overall carbon footprint of the lower fuel consumption setting, with fuel manufacturing and distribution factored in, might easily have been more impactful than the emissions model, which typically does not factor in fuel consumed.

  • Most of the time these days compiler writers are not cheating like VW did. In the 1980s compiler writers would insert code to recognize performance tests and then cheat - output values hard coded into the compiler instead of running the algorithm. Which is the type of thing that VW got in trouble for.

    These days most compilers are trying to make the general case of code fast and they rarely look for benchmarks. I won't say they never do this - just that it is much less common - if only because magazine reviews/benchmarks are not nearly as important as they used to be and so the incentive is gone.

  • Actually performing well on a task that is used as a benchmark is not comparable to decieving authorities about how much toxic gas you are releasing.

  • Only because what VW did is illegal, was super large scale, and could be linked to a lot of indirect deaths through the additional pollution.

    Benchmark optimizations are slightly embarrassing at worst, and an "optimization for a specific use case" at best. There's no regulation against optimizing for a particular task, everyone does it all the time, in some cases it's just not communicated transparently.

    Phone manufacturers were caught "optimizing" for benchmarks again and again, removing power limits to boost scores. Hard to name an example without searching the net because it's at most a faux pas.

  • GPT-3.5 did not “cheat” on chess benchmarks, though, it was actually just better at chess?

    • I think the OP's point is that chat GPT-3.5 may have a chess-engine baked-in to its (closed and unavailable) code for PR purposes. So it "realizes" that "hey, I'm playing a game of chess" and then, rather than doing whatever it normally does, it just acts as a front-end for a quite good chess-engine.

      1 reply →

  • Not quite. VW got in trouble for running _different_ software in test vs prod. These optimizations are all going to "prod" but are only useful for specific targets (a specific game in this case).

    • > VW got in trouble for running _different_ software in test vs prod.

      Not quite. They programmed their "prod" software to recognise the circumstances of a laboratory test and behave differently. Namely during laboratory emissions testing they would activate emission control features they would not activate otherwise.

      The software was the same they flash on production cars. They were production cars. You could take a random car from a random dealership and it would have done the same trickery in the lab.

      2 replies →

  • It’s approximately bad, like most of ML

    On one side:

    Would you expect a model trained on no Spanish data to do well on Spanish?

    On the other:

    Is it okay to train on the MMLU test set?