Comment by SamBam

6 days ago

Indeed. Also the whole thing is just "apply an Evolutionary Algorithm to stories." The only interesting question is whether the LLM that decides the story's fitness rating (which they call Elo, despite seeming to have nothing to do with the actual Elo ranking system) can mimic a human's rating. Given the brief example, it's not clear that it can, since it seems no better.

To clarify we use an Elo ranking system to update models scores, so if you loose to a higher rated story you don't loose as much Elo ranking. Definitely agree with LLM judge criticism though it's still an open questions of how we can make them better. Using the repeated story comparison judging system does help make them more consistent. A good rubric helps make them more human like as-well. The really big question is how large is the generator verifier gap between creating stories and marking them