Comment by visarga
3 months ago
There is a huge point - those prompts have answers, followed by more prompts and answers. If you look at an AI answer in hindsight you can often spot if it was a good or bad response from the next messages. So you can derive a preference score, and train your preference model, then do RLHF on the base model. You also get separation (privacy protection) this way.
No comments yet
Contribute on Hacker News ↗