Comment by merlindru

2 months ago

i tried out gpt 5.4 xhigh and it did meaningfully worse with the same prompt as opus 4.6. like, obvious mistakes

4 comments

merlindru

I've been pretty satisfied using oh-my-openagent (omo) on opencode with both opus-4.6 and gpt-5.4 lately. The author of omo suggests different prompting strategies for different models and goes into some detail here. https://github.com/code-yeongyu/oh-my-openagent/blob/dev/doc... For each agent they define, they change the prompt depending on which model is being used to fit it. I wonder how much of the "x did worse than y for the same prompt" tests could be improved if the prompts were actually tailored to what the model is good at. I also wonder if any of this matters or if it's all a crock of bologna..

merlindru 2 months ago

i think it may matter a good bit. i definitely have to write in different styles with different models (and catch myself doing so unintentionally) now that you mention it...
definitely not bologna, at least anecdotally :)

kasey_junk 2 months ago

Fwiw I run this eval every week on a set of known prompts and I believe the in group differences are bigger than out group.

That is I get more variance between opus 4.6 and itself than I do between the sota models.

I don’t have the budget for statistical relevance but I’m convinced people claiming broad differences are just vibing, or there are times when agent features make a big difference.

merlindru 2 months ago

it may be the agent features in my case. now that i think about it, i also forgot that my CLAUDE.md is different from my AGENTS.md
either way, all that one can really rely on is the benchmarks, and those are easily cheated/overfitted to.
i think it's all very hard to quantify, so take my previous comment with a massive rock of salt