Comment by hodapp
5 months ago
You are right; the advanced in DeepSeek-R1 used RL almost solely because of the chain-of-thought sequences they were generating and training it on.
5 months ago
You are right; the advanced in DeepSeek-R1 used RL almost solely because of the chain-of-thought sequences they were generating and training it on.
No comments yet
Contribute on Hacker News ↗