Comment by kimjune01
5 hours ago
Although Arena is adversarial and resistant to goodharting, it's not immune. Models that train on Arena converge on helpfulness, not necessarily truthiness
5 hours ago
Although Arena is adversarial and resistant to goodharting, it's not immune. Models that train on Arena converge on helpfulness, not necessarily truthiness
No comments yet
Contribute on Hacker News ↗