Comment by jedberg
4 days ago
WOPR used reinforcement learning, and could learn from its simulated mistakes. LLMs can't do that without some sort of RL harness. :)
4 days ago
WOPR used reinforcement learning, and could learn from its simulated mistakes. LLMs can't do that without some sort of RL harness. :)
No comments yet
Contribute on Hacker News ↗