Comment by tamassimond
6 days ago
To clarify we use an Elo ranking system to update models scores, so if you loose to a higher rated story you don't loose as much Elo ranking. Definitely agree with LLM judge criticism though it's still an open questions of how we can make them better. Using the repeated story comparison judging system does help make them more consistent. A good rubric helps make them more human like as-well. The really big question is how large is the generator verifier gap between creating stories and marking them
No comments yet
Contribute on Hacker News ↗