Comment by daveguy

15 hours ago

It may have been tested on the full set, but the score you quote is for a single game environment. Not the full public set. That fact is verbatim in what you responded to and vbarrielle quoted. It scored 97% in one game, and 0% in another game. The full prelude to what vbarrielle quoted, the last sentence of which you left out, was:

> We then tested the harnesses on the full public set (which researchers did not have access to at the time). We found extreme bimodal performance across the two sets, controlling for the same frontier model...

The harness only transfers to like-environments and the intelligence for those specific games is baked into the harness by the humans who coded it for this specific challenge.

The point of ARC-AGI is to test the intelligence of AI systems in novel, but simple, environments. Having a human give it more powerful tools in a harness defeats the purpose. You should go back and read the original ARC-AGI paper to see what this is about+. Are you upset about the benchmark because frontier LLM models do so poorly exhibiting the ability to generalize when the benchmarks are released?

+ https://arxiv.org/abs/1911.01547

3 comments

daveguy

fc417fc802 14 hours ago

> intelligence for those specific games is baked into the harness

This is your claim but the other commenter claims the harness consists only of generic tools. What's the reality?

I also encountered confusion about this exact issue in another subthread. I had thought that generic tooling was allowed but others believed the benchmark to be limited to ingesting the raw text directly from the API without access to any agent environment however generic it might be.

daveguy 1 hour ago

1) Pointing out what tools to use is part of the intelligence that LLMs aren't great at.
2) one of the tools is a path finding algorithm. A big improvement/crutch over a regular LLM that has no such capability.
You'd think if LLMs are intelligent they'd be able to determine that a path finding algorithm is necessary and have a sub agent code it up real quick. But apparently they just can't do that without humans stepping in to make it a standard tool for them.
Here's the paper on what they did for the Duke Harness:
https://blog.alexisfox.dev/arcagi3