Comment by mynti

5 days ago

Does anyone know what kind of RL environments they are talking about? They mention they used 15k environments. I can think of a couple hundred maybe that make sense to me, but what is filling that large number?

9 comments

mynti

robkop 5 days ago

Rumours say you do something like:

  Download every github repo
    -> Classify if it could be used as an env, and what types
      -> Issues and PRs are great for coding rl envs
      -> If the software has a UI, awesome, UI env
      -> If the software is a game, awesome, game env
      -> If the software has xyz, awesome, ...
    -> Do more detailed run checks, 
      -> Can it build
      -> Is it complex and/or distinct enough
      -> Can you verify if it reached some generated goal
      -> Can generated goals even be achieved
      -> Maybe some human review - maybe not
    -> Generate goals
      -> For a coding env you can imagine you may have a LLM introduce a new bug and can see that test cases now fail. Goal for model is now to fix it
    ... Do the rest of the normal RL env stuff

NitpickLawyer 5 days ago
The real real fun begins when you consider that with every new generation of models + harnesses they become better at this. Where better can mean better at sorting good / bad repos, better at coming up with good scenarios, better at following instructions, better at navigating the repos, better at solving the actual bugs, better at proposing bugs, etc.
So then the next next version is even better, because it got more data / better data. And it becomes better...
This is mainly why we're seeing so many improvements, so fast (month to month, from every 3 months ~6 monts ago, from every 6 months ~1 year ago). It becomes a literal "throw money at the problem" type of improvement.
For anything that's "verifiable" this is going to continue. For anything that is not, things can also improve with concepts like "llm as a judge" and "council of llms". Slower, but it can still improve.
- alex43578 5 days ago
  
  Judgement-based problems are still tough - LLM as a judge might just bake those earlier model’s biases even deeper. Imagine if ChatGPT judged photos: anything yellow would win.
  
  2 replies →
- losvedir 5 days ago
  
  Yeah, it's very interesting. Sort of like how you need microchips to design microchips these days.
sandGorgon 4 days ago

this is actually a very valid technique. We do the same (as an rl environments provider).
Except we bundle it with a custom browser renderer which actually generates rewards based on dom diff...and not screenshot based.
the browser renderer is opensource https://github.com/wootzapp/wootz-browser

yorwba 5 days ago

Every interactive system is a potential RL environment. Every CLI, every TUI, every GUI, every API. If you can programmatically take actions to get a result, and the actions are cheap, and the quality of the result can be measured automatically, you can set up an RL training loop and see whether the results get better over time.

radarsat1 5 days ago

> and the quality of the result can be measured automatically
this part is nontrivial though