Comment by nrhrjrjrjtntbt

3 months ago

Yes. The learning comes from running tests on the program and ensuring they pass. So running as an agent. Tests and compiler give hard feedback- thats the data outside the model that it learns from.

I think modern RLHF schemes have models that train LLMs. LLMs teaching each other isn't new.