Comment by avereveard

20 hours ago

"astounding how much the harness matters" is the right read and it should be the lasting one. the model is rentable, the prompts are rentable, the benchmark numbers are mostly a function of the harness around them. swapping Gemini for Sonnet underneath the same harness has a smaller bench delta than swapping the harness around the model. the cheating-agents post you linked is the same observation through a different lens, the harness is what's being measured, the model is just the substrate.

that said context management seem to be solving today model problems, more than being an universal property, and will probably be obsoleted a few model generations down the road, as tool obsoleted RAG context injection from question embeddings.

That's why ARC-AGI-3 doesn't allow the use of a harnesses. The model has to create the harness instead.

  • Seems completely backwards to me. This is like judging Formula 1 just by the raw power of the engine. The rest of the car has just as much engineering, if not more.

  • The model is not allowed to create a harness either, I think.

    • it can, it just has to be within the same 'session', but it's mostly limited to scratch notes afaik since there's no python or bash, yah if there's no way to execute code there's no real way to build a harness.