Comment by pshirshov
3 days ago
Yes, I know all the flaws. As I said, it's not an objective way to measure performance of a model - but it is intended to produce something that only humans could mesaure. The goal is for you to being able to play the game and judge - and fill the human checklist for yourself if you wish.
You didn't get why the automatic review scores are there - all of the reviewers, including Fable, happily assign highest scores to code which can't even run. In my opinion that is a sort of an empirical evidence that these models are very far from the "AGI" state.
Anyway, while I didn't explain the methodology and the purpose of this experiment, I have something material to discuss. The "awesome Fable" claims are not material at all.
Can you bring something clearly showcasing Fable's superiority?
No comments yet
Contribute on Hacker News ↗