Offhand, do you know what format that data is in? Is it a question and then a human answering that question? Mostly just curious at to what the training data consists of.
The most advanced training data is in the form of rubrics as rewards.
A human asks a question, then writes rubrics to judge the LLMs response, so rather than evaluating a specific response, those rubrics can live on as the LLM evolves and gives different answers. There are more complex variants as well, but that's the basic principle.
Offhand, do you know what format that data is in? Is it a question and then a human answering that question? Mostly just curious at to what the training data consists of.
The most advanced training data is in the form of rubrics as rewards.
A human asks a question, then writes rubrics to judge the LLMs response, so rather than evaluating a specific response, those rubrics can live on as the LLM evolves and gives different answers. There are more complex variants as well, but that's the basic principle.
https://arxiv.org/abs/2507.17746