Comment by sfink
7 hours ago
The report I saw kind of seemed to be pointing at a flaw and asking "do you see it?" which is not the same thing. I felt a pretty large difference between Opus 4.6's results and Mythos's, so I would be surprised if even weaker models did anywhere near as well. I'd like to see these results, if they are using a decent methodology.
Of course, even the reports with flawed methodology could be suggesting that a great harness + weak model might achieve a similar level of results as a mediocre harness + strong model. But I'd want to see solid evidence for that.
No comments yet
Contribute on Hacker News ↗