Comment by SkiFire13

6 days ago

This completely misses the point of reinforced learning. The reward condition needs to be representative of what you want (e.g. in chess that would be winning).

Using a LLM as a judge means you will ultimately optimize for stories that are liked by the LLM, not necessarily for stories that are liked by people. For this to work the other LLM needs to be as close to a human as possible, but this is what you were trying to do in the first place!