Comment by nrhrjrjrjtntbt
18 hours ago
Yes. The learning comes from running tests on the program and ensuring they pass. So running as an agent. Tests and compiler give hard feedback- thats the data outside the model that it learns from.
I think modern RLHF schemes have models that train LLMs. LLMs teaching each other isn't new.
My knowledge is limited, just based on a read of https://huyenchip.com/2023/05/02/rlhf.html though.
RLHF