Comment by codeflo

1 year ago

At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training. That's not something specific to LLMs or OpenAI. Compiler companies have done the same thing for decades, specifically detecting common benchmark programs and inserting hand-crafted optimizations. Similarly, the shader compilers in GPU drivers have special cases for common games and benchmarks.

55 comments

codeflo

darkerside 1 year ago

VW got in a lot of trouble for this

sigmoid10 1 year ago
Apples and oranges. VW actually cheated on regulatory testing to bypass legal requirements. So to be comparable, the government would first need to pass laws where e.g. only compilers that pass a certain benchmark are allowed to be used for purchasable products and then the developers would need to manipulate behaviour during those benchmarks.
- rsynnott 1 year ago
  
  There's a sliding scale of badness here. The emissions cheating (it wasn't just VW, incidentally; they were just the first uncovered. Fiat-Chrysler, Mercedes, GM and BMW were also caught doing it, with suspicions about others) was straight-up fraud.
  It used to be common for graphics drivers to outright cheat on benchmarks (the actual image produced would not be the same as it would have been if a benchmark had not been detected); this was arguably, fraud.
  It used to be common for mobile phone manufacturers to allow the SoC to operate in a thermal mode that was never available to real users when it detected a benchmark was being used. This is still, IMO, kinda fraud-y.
  Optimisation for common benchmark cases where the thing still actually _works_, and where the optimisation is available to normal users where applicable, is less egregious, though, still, IMO, Not Great.
- 0xFF0123 1 year ago
  
  The only difference is the legality. From an integrity point of view it's basically the same
  
  33 replies →
- waffletower 1 year ago
  
  Tesla cheats by using electric motors and deferring emissions standards to somebody else :D Wait, I really think that's a good thing, but once Hulk Hogan is confirmed administrator of the EPA, he might actually use this argument against Teslas and other electric vehicles.
ArnoVW 1 year ago
True. But they did not optimize for a specific case. They detected the test and then enabled a special regime, that was not used normally.
It’s as if OpenAI detects the IP address from a benchmark organization, and then used a completely different model.
- K0balt 1 year ago
  
  This is the apples to apples version. Perhaps might be more accurate to say that when detecting a benchmark attempt the model tries the prompt 3 times with different seeds then picks the best answer, otherwise it just zero-shots the prompt in everyday use.
  I say this because the be test still uses the same hardware (model) but changed the way it behaved by running emissions friendly parameters ( a different execution framework) that wouldn’t have been used in everyday driving, where fuel efficiency and performance optimized parameters were used instead.
  What I’d like to know is if it actually was unethical or not. The overall carbon footprint of the lower fuel consumption setting, with fuel manufacturing and distribution factored in, might easily have been more impactful than the emissions model, which typically does not factor in fuel consumed.
bluGill 1 year ago

Most of the time these days compiler writers are not cheating like VW did. In the 1980s compiler writers would insert code to recognize performance tests and then cheat - output values hard coded into the compiler instead of running the algorithm. Which is the type of thing that VW got in trouble for.
These days most compilers are trying to make the general case of code fast and they rarely look for benchmarks. I won't say they never do this - just that it is much less common - if only because magazine reviews/benchmarks are not nearly as important as they used to be and so the incentive is gone.
Swenrekcah 1 year ago

Actually performing well on a task that is used as a benchmark is not comparable to decieving authorities about how much toxic gas you are releasing.
close04 1 year ago

Only because what VW did is illegal, was super large scale, and could be linked to a lot of indirect deaths through the additional pollution.
Benchmark optimizations are slightly embarrassing at worst, and an "optimization for a specific use case" at best. There's no regulation against optimizing for a particular task, everyone does it all the time, in some cases it's just not communicated transparently.
Phone manufacturers were caught "optimizing" for benchmarks again and again, removing power limits to boost scores. Hard to name an example without searching the net because it's at most a faux pas.
conradev 1 year ago
GPT-3.5 did not “cheat” on chess benchmarks, though, it was actually just better at chess?
- GolfPopper 1 year ago
  
  I think the OP's point is that chat GPT-3.5 may have a chess-engine baked-in to its (closed and unavailable) code for PR purposes. So it "realizes" that "hey, I'm playing a game of chess" and then, rather than doing whatever it normally does, it just acts as a front-end for a quite good chess-engine.
  
  1 reply →
TrueDuality 1 year ago
Not quite. VW got in trouble for running _different_ software in test vs prod. These optimizations are all going to "prod" but are only useful for specific targets (a specific game in this case).
- krisoft 1 year ago
  
  > VW got in trouble for running _different_ software in test vs prod.
  Not quite. They programmed their "prod" software to recognise the circumstances of a laboratory test and behave differently. Namely during laboratory emissions testing they would activate emission control features they would not activate otherwise.
  The software was the same they flash on production cars. They were production cars. You could take a random car from a random dealership and it would have done the same trickery in the lab.
  
  2 replies →
gdiamos 1 year ago

It’s approximately bad, like most of ML
On one side:
Would you expect a model trained on no Spanish data to do well on Spanish?
On the other:
Is it okay to train on the MMLU test set?
tightbookkeeper 1 year ago

This is 10 year old story. It’s very interesting which ones stay in the public consciousness.
newerman 1 year ago

Funny response; you're not wrong.

dang 1 year ago

We detached this subthread from https://news.ycombinator.com/item?id=42144784.

(Nothing wrong with it! It's just a bit more generic than the original topic.)