Comment by p4coder
8 hours ago
Sometimes getting a second pair of eyes to look at the problem helps and is usually not a judgement of smartness of the first pair of eyes. Seems like it also applies to coding agents.
8 hours ago
Sometimes getting a second pair of eyes to look at the problem helps and is usually not a judgement of smartness of the first pair of eyes. Seems like it also applies to coding agents.
Indeed, I've also found that various models are good at various tasks, but I have yet been able to categorize "Model X is good at Y-class of bugs", so I end up using N models for a first pass "Find the root-cause of this issue", then once it's found, pass it along to same N models for them to attempt to solve it.
So far, which model can find/solve what is really scattered all over the place.
You are experiencing the jagged skills frontier. All models have these weird skill gaps and prompt phrasing sensitivity. This is the main problem solved by an llm-consortium. It's expensive running multiple models in parallel for the same prompt, but the time saved is worth it for gnarly problems. It fills in the gaps between models to tame the jagged frontier.
My very first use of the llm-consortium saw me feeding in it's own source code to look for bugs. It surfaced a serious bug which only one out of the three models had spotted. Lots of problems are NP-ish so parallel sampling works really well. Googles IMO gold and openais IOI gold both used parallel reasoning of some sort.
This is so true. Another thing, a model might be better at something in general, but worse if the context is too long. Looking at how GLM-4.5 is trained, on lots of short context, this may be the case for it.
GPT-5: Exceptional at abstract reasoning, planning and following the intention behind instructions. Concise and intentional. Not great at manipulating text or generating python code.
Gemini 2.5 Pro: Exceptional at manipulating text and python, not great at abstract reasoning. Verbose. Doesn't follow instructions well.
Another thing I've learned is that models work better when they work on code that they themselves generated. It's "in distribution" and more comprehensible to them.
The good old regression to the mean. Testing models as the second pair of eyes only when the first fails is going to give weird results... https://www.smbc-comics.com/comic/protocol