← Back to context

Comment by tlb

1 year ago

All these claims are like "programming is impossible because I typed in a program and it had a bug". Yes, everyone's first attempt at a reward function is hackable. So you have to tighten up the reward function to exclude solutions you don't want.

What claim would that be? It’s a hilarious, factual example of unintended consequences in model training. Of course they fixed the freaking bug in about two seconds.