Comment by mynti
5 days ago
Does anyone know what kind of RL environments they are talking about? They mention they used 15k environments. I can think of a couple hundred maybe that make sense to me, but what is filling that large number?
5 days ago
Does anyone know what kind of RL environments they are talking about? They mention they used 15k environments. I can think of a couple hundred maybe that make sense to me, but what is filling that large number?
Rumours say you do something like:
The real real fun begins when you consider that with every new generation of models + harnesses they become better at this. Where better can mean better at sorting good / bad repos, better at coming up with good scenarios, better at following instructions, better at navigating the repos, better at solving the actual bugs, better at proposing bugs, etc.
So then the next next version is even better, because it got more data / better data. And it becomes better...
This is mainly why we're seeing so many improvements, so fast (month to month, from every 3 months ~6 monts ago, from every 6 months ~1 year ago). It becomes a literal "throw money at the problem" type of improvement.
For anything that's "verifiable" this is going to continue. For anything that is not, things can also improve with concepts like "llm as a judge" and "council of llms". Slower, but it can still improve.
Judgement-based problems are still tough - LLM as a judge might just bake those earlier model’s biases even deeper. Imagine if ChatGPT judged photos: anything yellow would win.
2 replies →
Yeah, it's very interesting. Sort of like how you need microchips to design microchips these days.
this is actually a very valid technique. We do the same (as an rl environments provider).
Except we bundle it with a custom browser renderer which actually generates rewards based on dom diff...and not screenshot based.
the browser renderer is opensource https://github.com/wootzapp/wootz-browser
Every interactive system is a potential RL environment. Every CLI, every TUI, every GUI, every API. If you can programmatically take actions to get a result, and the actions are cheap, and the quality of the result can be measured automatically, you can set up an RL training loop and see whether the results get better over time.
> and the quality of the result can be measured automatically
this part is nontrivial though