Comment by simonw

20 hours ago

The leap from GPT-4 to GPT-5.5 has been astounding in my opinion. There is no way GPT-4 could run a coding agent harness like Codex at even a fraction of the quality that GPT-5.5 does.

2 comments

simonw

anon373839 19 hours ago

I don’t think that’s exactly indicative of GPT-5.5 being an astoundingly more intelligent model, however. An alternate interpretation is that GPT-5.5 was trained on tool usage/harness patterns and has been optimized for this use case.

I remember that even when GPT-4 was king, the Gorilla paper showed that Llama 7B could be fine-tuned to outperform GPT-4 on tool calling.

On domains that don’t involve agentic tool calling*, I haven’t found the frontier to have advanced that much.

Edit: I should broaden this to domains that naturally lend themselves to RLVR training. Models are drastically better at math now.

baq 7 hours ago

None of this matters in the product: it either is capable of agentic loop workflows or it isn’t. A 10% improvement in probability of single task success makes or breaks the use case.