Comment by nrhrjrjrjtntbt
1 day ago
Yes. The learning comes from running tests on the program and ensuring they pass. So running as an agent. Tests and compiler give hard feedback- thats the data outside the model that it learns from.
I think modern RLHF schemes have models that train LLMs. LLMs teaching each other isn't new.
My knowledge is limited, just based on a read of https://huyenchip.com/2023/05/02/rlhf.html though.
RLHF