That's impressive. I'm also a bit surprised - I wouldn't have expected it to be trained much at all on that sort of visual input task. I think I'd be similarly surprised to learn that a frontier model was particularly good at playing retro videogames or actuating a robot for example.
However, if it can't figure out to render the json to a visual on its own does it really qualify as AGI? I'd still say the benchmark is doing its job here. Granted it's not a perfectly even playing field in that case but I think the goal is to test for progress towards AGI as opposed to hosting a fair tournament.
> However, if it can't figure out to render the json to a visual on its own does it really qualify as AGI? I'd still say the benchmark is doing its job here.
Can you render serialized JSON text blob to a visual with your brain only? The model can't do anything better than this - no harness means no tool at all, no way to e.g. implement a visualizer in whatever programming language and run it.
Why don't human testers receive the same JSON text blob and no visualizers? It's like giving human testers a harness (a playable visualizer), but deliberately cripples it for the model.
Huh. I thought it wasn't supposed to receive any instructions tailored to the task but I didn't understand it to be restricted from accessing truly general tools such as programming languages. To do otherwise is to require pointless hoop jumping as frontier models inevitably get retrained to play games using a json (or other arbitrary) representation at which point it will be natural for them and the real test will begin.
Source? I haven't seen anything like that for ARC-AGI performance.
Also, if it makes that big of a difference, then make a renderer for your agent that looks like the web page and have it solve them in the graphical interface and funnel the results to the API. I guarantee you won't get better performance, because the AGI is going to have to "understand" the raw data can be represented as a 2D matrix regardless of whether it gets a 2D matrix of pixels or a 2D matrix of enumeration in JSON. If anything, that makes it a more difficult problem for a AI system that "speaks" in tokens.
That score is in the arc technical paper [1]. It's the full benchmark score using this harness [2] (which is just open code with read, grep, bash tools).
This is already a solved benchmark. That's why scoring is so convoluted and a self proclaimed Agent benchmark won't allow basic agent tools. ARC has always been a bit of a nothing burger of a benchmark but this takes the cake.
> For example, in a variant of environment TR87, Opus 4.6 scores 0.0% with no harness and 97.1% with the Duke harness (12), yet in environment BP35, Opus 4.6 scores 0.0% under both configuration
This is with a harness that has been designed to tackle "a small set of public environments: ls20, ft09, and vc33" (of the arc-agi-3 challenge), yet it looks like it does not solve the full arc-agi-3 benchmark, just some of them.
That's impressive. I'm also a bit surprised - I wouldn't have expected it to be trained much at all on that sort of visual input task. I think I'd be similarly surprised to learn that a frontier model was particularly good at playing retro videogames or actuating a robot for example.
However, if it can't figure out to render the json to a visual on its own does it really qualify as AGI? I'd still say the benchmark is doing its job here. Granted it's not a perfectly even playing field in that case but I think the goal is to test for progress towards AGI as opposed to hosting a fair tournament.
> However, if it can't figure out to render the json to a visual on its own does it really qualify as AGI? I'd still say the benchmark is doing its job here.
Can you render serialized JSON text blob to a visual with your brain only? The model can't do anything better than this - no harness means no tool at all, no way to e.g. implement a visualizer in whatever programming language and run it.
Why don't human testers receive the same JSON text blob and no visualizers? It's like giving human testers a harness (a playable visualizer), but deliberately cripples it for the model.
Huh. I thought it wasn't supposed to receive any instructions tailored to the task but I didn't understand it to be restricted from accessing truly general tools such as programming languages. To do otherwise is to require pointless hoop jumping as frontier models inevitably get retrained to play games using a json (or other arbitrary) representation at which point it will be natural for them and the real test will begin.
1 reply →
Source? I haven't seen anything like that for ARC-AGI performance.
Also, if it makes that big of a difference, then make a renderer for your agent that looks like the web page and have it solve them in the graphical interface and funnel the results to the API. I guarantee you won't get better performance, because the AGI is going to have to "understand" the raw data can be represented as a 2D matrix regardless of whether it gets a 2D matrix of pixels or a 2D matrix of enumeration in JSON. If anything, that makes it a more difficult problem for a AI system that "speaks" in tokens.
That score is in the arc technical paper [1]. It's the full benchmark score using this harness [2] (which is just open code with read, grep, bash tools).
This is already a solved benchmark. That's why scoring is so convoluted and a self proclaimed Agent benchmark won't allow basic agent tools. ARC has always been a bit of a nothing burger of a benchmark but this takes the cake.
[1] https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf
[2] https://blog.alexisfox.dev/arcagi3
> For example, in a variant of environment TR87, Opus 4.6 scores 0.0% with no harness and 97.1% with the Duke harness (12), yet in environment BP35, Opus 4.6 scores 0.0% under both configuration
This is with a harness that has been designed to tackle "a small set of public environments: ls20, ft09, and vc33" (of the arc-agi-3 challenge), yet it looks like it does not solve the full arc-agi-3 benchmark, just some of them.
3 replies →