Comment by sillysaurusx
4 days ago
It’s been said that RL is the worst way to train a model, except for all the others. Many prominent scientists seem to doubt that this is how we’ll be training cutting edge models in a decade. I agree, and I encourage you to try to think of alternative paradigms as you go through this course.
If that seems unlikely, remember that image generation didn’t take off till diffusion models, and GPTs didn’t take off till RLHF. If you’ve been around long enough it’ll seem obvious that this isn’t the final step. The challenge for you is, find the one that’s better.
You're assuming that people are only interested in image and text generation.
RL excels at learning control problems. It is mathematically guaranteed to provide an optimal solution for the state and controls you provide it, given enough runtime. For some problems (playing computer games), that runtime is surprisingly short.
There is a reason self-driving cars use RL, and don't use GPTs.
> self-driving cars use RL
Some part of it, but I would argue with a lot of guardrail in place and not as common as you think. I don't think the majority of the planner/control stack out there in SDC is based. I also don't think any production SDCs are RL-based.
Based on the zoox iccv talk, it sounds like their main planner is RL.
I have been using it to train it on my game hotlapdaily
Apparently AI sets the best time even better than the pros It is really useful when it comes to controlled environment optimizations
You are exactly right.
Control theory and reinforcement learning are different ways of looking at the same problem. They traditionally and culturally focussed on different aspects.
RL is barely even a training method, its more of a dataset generation method.
I feel like both this comment and the parent comment highlight how RL has been going through a cycle of misunderstanding recently from another one of its popularity booms due to being used to train LLMs
care to correct the misunderstanding?
1 reply →
Its reductive, but also roughly correct.
1 reply →
RL is still widely used in the advertising industry. Don't let anyone tell you otherwise. When you have millions to billions of visits and you are trying to optimize an outcome RL is very good at that. Add in context with contextual multi-armed bandits and you have something very good at driving people towards purchasing.
What about for combinatorial optimization? When you have a simulation of the world what other paradigms are fitting
More likely we will develop general super intelligent AI before we (together with our super intelligent friends) solve the problem of combinatorial optimization.
There's nothing to solve. The CoD kills you no matter what. P=NP or maybe quantum computing is the only hope of making serious progress on large-scale combinatorial optimization.
I like to think of RLHF as a technique that I, as a student, used to apply to score good marks in my exam. As soon as I started working, I realized that out-of-distribution generalization can't be only achieved from practicing in an environment with verifiable rewards.
GPT wouldn't have even been possible, let alone take off, without self supervised learning.
RLHF is what gave us the ChatGPT moment. Self supervised learning was the base for this.
SSL creates all the connections and RL learns to walk the paths
The easy to use web interface gave us the ChatGPT moment. Take a look at AI Dungeon for GPT2. It went viral due to making using GPT2 accessible.
1 reply →