Apples and oranges. VW actually cheated on regulatory testing to bypass legal requirements. So to be comparable, the government would first need to pass laws where e.g. only compilers that pass a certain benchmark are allowed to be used for purchasable products and then the developers would need to manipulate behaviour during those benchmarks.
There's a sliding scale of badness here. The emissions cheating (it wasn't just VW, incidentally; they were just the first uncovered. Fiat-Chrysler, Mercedes, GM and BMW were also caught doing it, with suspicions about others) was straight-up fraud.
It used to be common for graphics drivers to outright cheat on benchmarks (the actual image produced would not be the same as it would have been if a benchmark had not been detected); this was arguably, fraud.
It used to be common for mobile phone manufacturers to allow the SoC to operate in a thermal mode that was never available to real users when it detected a benchmark was being used. This is still, IMO, kinda fraud-y.
Optimisation for common benchmark cases where the thing still actually _works_, and where the optimisation is available to normal users where applicable, is less egregious, though, still, IMO, Not Great.
I think breaking a law is more unethical than not breaking a law.
Also, legality isn't the only difference in the VW case. With VW, they had a "good emissions" mode. They enabled the good emissions mode during the test, but disabled it during regular driving. It would have worked during regular driving, but they disabled it during regular driving. With compilers, there's no "good performance" mode that would work during regular usage that they're disabling during regular usage.
I disagree- presumably if an algorithm or hardware is optimized for a certain class of problem it really is good at it and always will be- which is still useful if you are actually using it for that. It’s just “studying for the test”- something I would expect to happen even if it is a bit misleading.
VW cheated such that the low emissions were only active during the test- it’s not that it was optimized for low emissions under the conditions they test for, but that you could not get those low emissions under any conditions in the real world. That's "cheating on the test" not "studying for the test."
How so? VW intentionally changed the operation of the vehicle so that its emissions met the test requirements during the test and then went back to typical operation conditions afterwards.
VW was breaking the law in a way that harmed society but arguably helped the individual driver of the VW car, who gets better performance yet still passes the emissions test.
That is not true. Even ChatGPT understands how they are different, I won’t paste the whole response but here are the differences it highlights:
Key differences:
1. Intent and harm:
• VW’s actions directly violated laws and had environmental and health consequences. Optimizing LLMs for chess benchmarks, while arguably misleading, doesn’t have immediate real-world harms.
2. Scope: Chess-specific optimization is generally a transparent choice within AI research. It’s not a hidden “defeat device” but rather an explicit design goal.
3. Broader impact: LLMs fine-tuned for benchmarks often still retain general-purpose capabilities. They aren’t necessarily “broken” outside chess, whereas VW cars fundamentally failed to meet emissions standards.
Tesla cheats by using electric motors and deferring emissions standards to somebody else :D Wait, I really think that's a good thing, but once Hulk Hogan is confirmed administrator of the EPA, he might actually use this argument against Teslas and other electric vehicles.
This is the apples to apples version. Perhaps might be more accurate to say that when detecting a benchmark attempt the model tries the prompt 3 times with different seeds then picks the best answer, otherwise it just zero-shots the prompt in everyday use.
I say this because the be test still uses the same hardware (model) but changed the way it behaved by running emissions friendly parameters ( a different execution framework) that wouldn’t have been used in everyday driving, where fuel efficiency and performance optimized parameters were used instead.
What I’d like to know is if it actually was unethical or not. The overall carbon footprint of the lower fuel consumption setting, with fuel manufacturing and distribution factored in, might easily have been more impactful than the emissions model, which typically does not factor in fuel consumed.
Most of the time these days compiler writers are not cheating like VW did. In the 1980s compiler writers would insert code to recognize performance tests and then cheat - output values hard coded into the compiler instead of running the algorithm. Which is the type of thing that VW got in trouble for.
These days most compilers are trying to make the general case of code fast and they rarely look for benchmarks. I won't say they never do this - just that it is much less common - if only because magazine reviews/benchmarks are not nearly as important as they used to be and so the incentive is gone.
Only because what VW did is illegal, was super large scale, and could be linked to a lot of indirect deaths through the additional pollution.
Benchmark optimizations are slightly embarrassing at worst, and an "optimization for a specific use case" at best. There's no regulation against optimizing for a particular task, everyone does it all the time, in some cases it's just not communicated transparently.
Phone manufacturers were caught "optimizing" for benchmarks again and again, removing power limits to boost scores. Hard to name an example without searching the net because it's at most a faux pas.
I think the OP's point is that chat GPT-3.5 may have a chess-engine baked-in to its (closed and unavailable) code for PR purposes. So it "realizes" that "hey, I'm playing a game of chess" and then, rather than doing whatever it normally does, it just acts as a front-end for a quite good chess-engine.
Not quite. VW got in trouble for running _different_ software in test vs prod. These optimizations are all going to "prod" but are only useful for specific targets (a specific game in this case).
> VW got in trouble for running _different_ software in test vs prod.
Not quite. They programmed their "prod" software to recognise the circumstances of a laboratory test and behave differently. Namely during laboratory emissions testing they would activate emission control features they would not activate otherwise.
The software was the same they flash on production cars. They were production cars. You could take a random car from a random dealership and it would have done the same trickery in the lab.
I disagree with your distinction on the environments but understand your argument. Production for VM to me is "on the road when a customer is using your product as intended". Using the same artifact for those different environments isn't the same as "running that in production".
Apples and oranges. VW actually cheated on regulatory testing to bypass legal requirements. So to be comparable, the government would first need to pass laws where e.g. only compilers that pass a certain benchmark are allowed to be used for purchasable products and then the developers would need to manipulate behaviour during those benchmarks.
There's a sliding scale of badness here. The emissions cheating (it wasn't just VW, incidentally; they were just the first uncovered. Fiat-Chrysler, Mercedes, GM and BMW were also caught doing it, with suspicions about others) was straight-up fraud.
It used to be common for graphics drivers to outright cheat on benchmarks (the actual image produced would not be the same as it would have been if a benchmark had not been detected); this was arguably, fraud.
It used to be common for mobile phone manufacturers to allow the SoC to operate in a thermal mode that was never available to real users when it detected a benchmark was being used. This is still, IMO, kinda fraud-y.
Optimisation for common benchmark cases where the thing still actually _works_, and where the optimisation is available to normal users where applicable, is less egregious, though, still, IMO, Not Great.
The only difference is the legality. From an integrity point of view it's basically the same
I think breaking a law is more unethical than not breaking a law.
Also, legality isn't the only difference in the VW case. With VW, they had a "good emissions" mode. They enabled the good emissions mode during the test, but disabled it during regular driving. It would have worked during regular driving, but they disabled it during regular driving. With compilers, there's no "good performance" mode that would work during regular usage that they're disabling during regular usage.
22 replies →
I disagree- presumably if an algorithm or hardware is optimized for a certain class of problem it really is good at it and always will be- which is still useful if you are actually using it for that. It’s just “studying for the test”- something I would expect to happen even if it is a bit misleading.
VW cheated such that the low emissions were only active during the test- it’s not that it was optimized for low emissions under the conditions they test for, but that you could not get those low emissions under any conditions in the real world. That's "cheating on the test" not "studying for the test."
> The only difference is the legality. From an integrity point of view it's basically the same
I think cheating about harming the environment is another important difference.
How so? VW intentionally changed the operation of the vehicle so that its emissions met the test requirements during the test and then went back to typical operation conditions afterwards.
VW was breaking the law in a way that harmed society but arguably helped the individual driver of the VW car, who gets better performance yet still passes the emissions test.
4 replies →
Right - in either case it's lying, which is crossing a moral line (which is far more important to avoid than a legal line).
That is not true. Even ChatGPT understands how they are different, I won’t paste the whole response but here are the differences it highlights:
Key differences:
1. Intent and harm: • VW’s actions directly violated laws and had environmental and health consequences. Optimizing LLMs for chess benchmarks, while arguably misleading, doesn’t have immediate real-world harms. 2. Scope: Chess-specific optimization is generally a transparent choice within AI research. It’s not a hidden “defeat device” but rather an explicit design goal. 3. Broader impact: LLMs fine-tuned for benchmarks often still retain general-purpose capabilities. They aren’t necessarily “broken” outside chess, whereas VW cars fundamentally failed to meet emissions standards.
Tesla cheats by using electric motors and deferring emissions standards to somebody else :D Wait, I really think that's a good thing, but once Hulk Hogan is confirmed administrator of the EPA, he might actually use this argument against Teslas and other electric vehicles.
True. But they did not optimize for a specific case. They detected the test and then enabled a special regime, that was not used normally.
It’s as if OpenAI detects the IP address from a benchmark organization, and then used a completely different model.
This is the apples to apples version. Perhaps might be more accurate to say that when detecting a benchmark attempt the model tries the prompt 3 times with different seeds then picks the best answer, otherwise it just zero-shots the prompt in everyday use.
I say this because the be test still uses the same hardware (model) but changed the way it behaved by running emissions friendly parameters ( a different execution framework) that wouldn’t have been used in everyday driving, where fuel efficiency and performance optimized parameters were used instead.
What I’d like to know is if it actually was unethical or not. The overall carbon footprint of the lower fuel consumption setting, with fuel manufacturing and distribution factored in, might easily have been more impactful than the emissions model, which typically does not factor in fuel consumed.
Most of the time these days compiler writers are not cheating like VW did. In the 1980s compiler writers would insert code to recognize performance tests and then cheat - output values hard coded into the compiler instead of running the algorithm. Which is the type of thing that VW got in trouble for.
These days most compilers are trying to make the general case of code fast and they rarely look for benchmarks. I won't say they never do this - just that it is much less common - if only because magazine reviews/benchmarks are not nearly as important as they used to be and so the incentive is gone.
Actually performing well on a task that is used as a benchmark is not comparable to decieving authorities about how much toxic gas you are releasing.
Only because what VW did is illegal, was super large scale, and could be linked to a lot of indirect deaths through the additional pollution.
Benchmark optimizations are slightly embarrassing at worst, and an "optimization for a specific use case" at best. There's no regulation against optimizing for a particular task, everyone does it all the time, in some cases it's just not communicated transparently.
Phone manufacturers were caught "optimizing" for benchmarks again and again, removing power limits to boost scores. Hard to name an example without searching the net because it's at most a faux pas.
GPT-3.5 did not “cheat” on chess benchmarks, though, it was actually just better at chess?
I think the OP's point is that chat GPT-3.5 may have a chess-engine baked-in to its (closed and unavailable) code for PR purposes. So it "realizes" that "hey, I'm playing a game of chess" and then, rather than doing whatever it normally does, it just acts as a front-end for a quite good chess-engine.
I see – my initial interpretation of OP’s “special case” was “Theory 2: GPT-3.5-instruct was trained on more chess games.”
But I guess it’s also a possibility that they had a real chess engine hiding in there.
Not quite. VW got in trouble for running _different_ software in test vs prod. These optimizations are all going to "prod" but are only useful for specific targets (a specific game in this case).
> VW got in trouble for running _different_ software in test vs prod.
Not quite. They programmed their "prod" software to recognise the circumstances of a laboratory test and behave differently. Namely during laboratory emissions testing they would activate emission control features they would not activate otherwise.
The software was the same they flash on production cars. They were production cars. You could take a random car from a random dealership and it would have done the same trickery in the lab.
I disagree with your distinction on the environments but understand your argument. Production for VM to me is "on the road when a customer is using your product as intended". Using the same artifact for those different environments isn't the same as "running that in production".
1 reply →
It’s approximately bad, like most of ML
On one side:
Would you expect a model trained on no Spanish data to do well on Spanish?
On the other:
Is it okay to train on the MMLU test set?
This is 10 year old story. It’s very interesting which ones stay in the public consciousness.
Funny response; you're not wrong.