Comment by jmalicki
8 hours ago
The most advanced training data is in the form of rubrics as rewards.
A human asks a question, then writes rubrics to judge the LLMs response, so rather than evaluating a specific response, those rubrics can live on as the LLM evolves and gives different answers. There are more complex variants as well, but that's the basic principle.
No comments yet
Contribute on Hacker News ↗