← Back to context

Comment by nrhrjrjrjtntbt

1 day ago

Yes. The learning comes from running tests on the program and ensuring they pass. So running as an agent. Tests and compiler give hard feedback- thats the data outside the model that it learns from.

I think modern RLHF schemes have models that train LLMs. LLMs teaching each other isn't new.

My knowledge is limited, just based on a read of https://huyenchip.com/2023/05/02/rlhf.html though.