Comment by enraged_camel

4 days ago

>> Why should I be hyped about all that "legitimate power" if the model performs on par with two other SoTAs?

I don't care if you're hyped or not. You asked if the posts like the OP come from a "parallel reality" and I said no and described my experience. If you're getting good/better results with Codex than with Fable, you should probably continue using that, since it's cheaper and faster.

But can you bring anything measurable in support to your words? I did.

  • You brought your own benchmark to support your words. I happen to have studied statistics, so I took a look. It is deeply flawed, primarily because it is not a statistical benchmark. It is a single (n=1) autonomous "pi" coding-harness run per model per prompt, scored by an automated battery (A-items, pass/fail), an LLM code review (R-items, 0 to 2 each), and a human manual checklist (M1 to M10) that was never actually completed.

    The grader being an LLM is a big problem. You yourself admit explicitly that the grader is the same model family as the Fable 5 contestant cell and say to "discount accordingly, or re-grade with a non-Claude judge."

    Model configurations appear to not be uniform either. Effort levels differ (mimo-v2.5-pro at @high, everyone else at @xhigh), harnesses differ (codex internal config vs. pi vs. claude -p), context windows differ, and one model (GPT-5.5) had extra MCP tools the others did not.

    The two scored runs seem to use two different rubrics (/22 then /25), so scores are not comparable across runs, and the /22 rubric saturated (there are multiple 22/22 results).

    A provider quota error (HTTP 429) truncated the minimax-m3 run mid-build but it was still scored (18/25) and ranked, on code that does that does not compile and has zero tests.

    If you want actual benchmarks, there are dozens of legitimate ones out there. Many of them have been posted on this website. They overwhelmingly disagree with yours. If you have any interest whatsoever in creating a reliable benchmark (so that you can make optimal decisions on what models to use for your work), you should look at them and see how yours needs to be redesigned.

    • Yes, I know all the flaws. As I said, it's not an objective way to measure performance of a model - but it is intended to produce something that only humans could mesaure. The goal is for you to being able to play the game and judge - and fill the human checklist for yourself if you wish.

      You didn't get why the automatic review scores are there - all of the reviewers, including Fable, happily assign highest scores to code which can't even run. In my opinion that is a sort of an empirical evidence that these models are very far from the "AGI" state.

      Anyway, while I didn't explain the methodology and the purpose of this experiment, I have something material to discuss. The "awesome Fable" claims are not material at all.

      Can you bring something clearly showcasing Fable's superiority?

  • The OP and GP need all genai news to be positive to the point of using doublespeak here unironically.

    "Relentlessly proactive" is a grotesque use of language. A paperclip optimizer is "relentlessly proactive".

    We already had a word for what is being promoted here: wasteful.