← Back to context

Comment by sillysaurusx

4 days ago

It’s been said that RL is the worst way to train a model, except for all the others. Many prominent scientists seem to doubt that this is how we’ll be training cutting edge models in a decade. I agree, and I encourage you to try to think of alternative paradigms as you go through this course.

If that seems unlikely, remember that image generation didn’t take off till diffusion models, and GPTs didn’t take off till RLHF. If you’ve been around long enough it’ll seem obvious that this isn’t the final step. The challenge for you is, find the one that’s better.

You're assuming that people are only interested in image and text generation.

RL excels at learning control problems. It is mathematically guaranteed to provide an optimal solution for the state and controls you provide it, given enough runtime. For some problems (playing computer games), that runtime is surprisingly short.

There is a reason self-driving cars use RL, and don't use GPTs.

  • > self-driving cars use RL

    Some part of it, but I would argue with a lot of guardrail in place and not as common as you think. I don't think the majority of the planner/control stack out there in SDC is based. I also don't think any production SDCs are RL-based.

  • I have been using it to train it on my game hotlapdaily

    Apparently AI sets the best time even better than the pros It is really useful when it comes to controlled environment optimizations

  • You are exactly right.

    Control theory and reinforcement learning are different ways of looking at the same problem. They traditionally and culturally focussed on different aspects.

RL is still widely used in the advertising industry. Don't let anyone tell you otherwise. When you have millions to billions of visits and you are trying to optimize an outcome RL is very good at that. Add in context with contextual multi-armed bandits and you have something very good at driving people towards purchasing.

What about for combinatorial optimization? When you have a simulation of the world what other paradigms are fitting

  • More likely we will develop general super intelligent AI before we (together with our super intelligent friends) solve the problem of combinatorial optimization.

    • There's nothing to solve. The CoD kills you no matter what. P=NP or maybe quantum computing is the only hope of making serious progress on large-scale combinatorial optimization.

I like to think of RLHF as a technique that I, as a student, used to apply to score good marks in my exam. As soon as I started working, I realized that out-of-distribution generalization can't be only achieved from practicing in an environment with verifiable rewards.

GPT wouldn't have even been possible, let alone take off, without self supervised learning.

  • RLHF is what gave us the ChatGPT moment. Self supervised learning was the base for this.

    SSL creates all the connections and RL learns to walk the paths

    • The easy to use web interface gave us the ChatGPT moment. Take a look at AI Dungeon for GPT2. It went viral due to making using GPT2 accessible.

      1 reply →