Comment by avereveard

20 hours ago

"astounding how much the harness matters" is the right read and it should be the lasting one. the model is rentable, the prompts are rentable, the benchmark numbers are mostly a function of the harness around them. swapping Gemini for Sonnet underneath the same harness has a smaller bench delta than swapping the harness around the model. the cheating-agents post you linked is the same observation through a different lens, the harness is what's being measured, the model is just the substrate.

that said context management seem to be solving today model problems, more than being an universal property, and will probably be obsoleted a few model generations down the road, as tool obsoleted RAG context injection from question embeddings.

7 comments

avereveard

himata4113 20 hours ago

That's why ARC-AGI-3 doesn't allow the use of a harnesses. The model has to create the harness instead.

grzracz 17 hours ago
Seems completely backwards to me. This is like judging Formula 1 just by the raw power of the engine. The rest of the car has just as much engineering, if not more.
- wyre 16 hours ago
  
  ARC-AGI is testing raw intelligence, like the raw power of a Formula 1 engine. The rest of the car is the harness.
  
  1 reply →
vova_hn2 17 hours ago
The model is not allowed to create a harness either, I think.
- himata4113 15 hours ago
  
  it can, it just has to be within the same 'session', but it's mostly limited to scratch notes afaik since there's no python or bash, yah if there's no way to execute code there's no real way to build a harness.