Comment by pshirshov

3 days ago

Yes, I know all the flaws. As I said, it's not an objective way to measure performance of a model - but it is intended to produce something that only humans could mesaure. The goal is for you to being able to play the game and judge - and fill the human checklist for yourself if you wish.

You didn't get why the automatic review scores are there - all of the reviewers, including Fable, happily assign highest scores to code which can't even run. In my opinion that is a sort of an empirical evidence that these models are very far from the "AGI" state.

Anyway, while I didn't explain the methodology and the purpose of this experiment, I have something material to discuss. The "awesome Fable" claims are not material at all.