Comment by fc417fc802

1 day ago

So what? Are you suggesting that an agent exhibiting genuine AGI will be tripped up by having to ingest json rather than rgb pixels? LLMs are largely trained on textual data so json is going to be much closer to whatever native is for them.

But by all means, give the agents access to an API that returns pixel data. However I fully expect that would reduce performance rather than increase it.

11 comments

fc417fc802

pawelk411 20 hours ago

Because it is. Opus 4.6 jumps from 0.0% to 97.1% when given visual input

fc417fc802 19 hours ago
That's impressive. I'm also a bit surprised - I wouldn't have expected it to be trained much at all on that sort of visual input task. I think I'd be similarly surprised to learn that a frontier model was particularly good at playing retro videogames or actuating a robot for example.
However, if it can't figure out to render the json to a visual on its own does it really qualify as AGI? I'd still say the benchmark is doing its job here. Granted it's not a perfectly even playing field in that case but I think the goal is to test for progress towards AGI as opposed to hosting a fair tournament.
- rfoo 19 hours ago
  
  > However, if it can't figure out to render the json to a visual on its own does it really qualify as AGI? I'd still say the benchmark is doing its job here.
  Can you render serialized JSON text blob to a visual with your brain only? The model can't do anything better than this - no harness means no tool at all, no way to e.g. implement a visualizer in whatever programming language and run it.
  Why don't human testers receive the same JSON text blob and no visualizers? It's like giving human testers a harness (a playable visualizer), but deliberately cripples it for the model.
  
  2 replies →
daveguy 12 hours ago
Source? I haven't seen anything like that for ARC-AGI performance.
Also, if it makes that big of a difference, then make a renderer for your agent that looks like the web page and have it solve them in the graphical interface and funnel the results to the API. I guarantee you won't get better performance, because the AGI is going to have to "understand" the raw data can be represented as a 2D matrix regardless of whether it gets a 2D matrix of pixels or a 2D matrix of enumeration in JSON. If anything, that makes it a more difficult problem for a AI system that "speaks" in tokens.
- famouswaffles 10 hours ago
  
  That score is in the arc technical paper [1]. It's the full benchmark score using this harness [2] (which is just open code with read, grep, bash tools).
  This is already a solved benchmark. That's why scoring is so convoluted and a self proclaimed Agent benchmark won't allow basic agent tools. ARC has always been a bit of a nothing burger of a benchmark but this takes the cake.
  [1] https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf
  [2] https://blog.alexisfox.dev/arcagi3
  
  4 replies →