Comment by tlb
1 year ago
All these claims are like "programming is impossible because I typed in a program and it had a bug". Yes, everyone's first attempt at a reward function is hackable. So you have to tighten up the reward function to exclude solutions you don't want.
What claim would that be? It’s a hilarious, factual example of unintended consequences in model training. Of course they fixed the freaking bug in about two seconds.