Yes. The learning comes from running tests on the program and ensuring they pass. So running as an agent. Tests and compiler give hard feedback- thats the data outside the model that it learns from.
I think modern RLHF schemes have models that train LLMs. LLMs teaching each other isn't new.
It’s basically called “reinforced learning” and it’s a common technique for machine learning.
You provide a goal as a big reward (eg test passing), and smaller rewards for any particular behaviours you want to encourage, and then leave the machine to figure out the best way to achieve those rewards through trial and error.
After a few million attempts, you generally either have a decent result, or more data around additional weights you need to apply before reiterating on the training.
So… llm learns from a corpus it has created?
Yes. The learning comes from running tests on the program and ensuring they pass. So running as an agent. Tests and compiler give hard feedback- thats the data outside the model that it learns from.
I think modern RLHF schemes have models that train LLMs. LLMs teaching each other isn't new.
My knowledge is limited, just based on a read of https://huyenchip.com/2023/05/02/rlhf.html though.
RLHF
It’s basically called “reinforced learning” and it’s a common technique for machine learning.
You provide a goal as a big reward (eg test passing), and smaller rewards for any particular behaviours you want to encourage, and then leave the machine to figure out the best way to achieve those rewards through trial and error.
After a few million attempts, you generally either have a decent result, or more data around additional weights you need to apply before reiterating on the training.
How do you define the goal? This kind of de novo neural program synthesis is a very hard problem.
4 replies →