Comment by nopinsight
3 hours ago
From Claude 4.6 Thinking:
OSWorld is the full 369-task benchmark. OSWorld Verified is a ~200-task subset where humans have confirmed the eval scripts reliably score success/failure — the full set has some noisy grading where correct actions can still get marked wrong.
Scores on Verified tend to run higher, so they're not directly comparable.
No comments yet
Contribute on Hacker News ↗