← Back to context

Comment by pshirshov

4 days ago

Ok, explain me one thing: I have a benchmark - I feed identical prompt to multiple models. Codex produces a rough but working program. Fable produces the same - but with more bugs than Codex. Opus produces something similar to Codex but with a critical bug.

That describes all my tests with Fable.

Why should I be hyped about all that "legitimate power" if the model performs on par with two other SoTAs?

I mean, well, yes, it is impressive. It could quickly generate a lot of garbage which sorta does look like code. Two others can do the same. I don't see any groundbreaking improvement - but the price is much higher. Why the hype?

>> Why should I be hyped about all that "legitimate power" if the model performs on par with two other SoTAs?

I don't care if you're hyped or not. You asked if the posts like the OP come from a "parallel reality" and I said no and described my experience. If you're getting good/better results with Codex than with Fable, you should probably continue using that, since it's cheaper and faster.

  • But can you bring anything measurable in support to your words? I did.

    • You brought your own benchmark to support your words. I happen to have studied statistics, so I took a look. It is deeply flawed, primarily because it is not a statistical benchmark. It is a single (n=1) autonomous "pi" coding-harness run per model per prompt, scored by an automated battery (A-items, pass/fail), an LLM code review (R-items, 0 to 2 each), and a human manual checklist (M1 to M10) that was never actually completed.

      The grader being an LLM is a big problem. You yourself admit explicitly that the grader is the same model family as the Fable 5 contestant cell and say to "discount accordingly, or re-grade with a non-Claude judge."

      Model configurations appear to not be uniform either. Effort levels differ (mimo-v2.5-pro at @high, everyone else at @xhigh), harnesses differ (codex internal config vs. pi vs. claude -p), context windows differ, and one model (GPT-5.5) had extra MCP tools the others did not.

      The two scored runs seem to use two different rubrics (/22 then /25), so scores are not comparable across runs, and the /22 rubric saturated (there are multiple 22/22 results).

      A provider quota error (HTTP 429) truncated the minimax-m3 run mid-build but it was still scored (18/25) and ranked, on code that does that does not compile and has zero tests.

      If you want actual benchmarks, there are dozens of legitimate ones out there. Many of them have been posted on this website. They overwhelmingly disagree with yours. If you have any interest whatsoever in creating a reliable benchmark (so that you can make optimal decisions on what models to use for your work), you should look at them and see how yours needs to be redesigned.

      1 reply →

    • The OP and GP need all genai news to be positive to the point of using doublespeak here unironically.

      "Relentlessly proactive" is a grotesque use of language. A paperclip optimizer is "relentlessly proactive".

      We already had a word for what is being promoted here: wasteful.