Comment by bogtog

7 days ago

I think the top performer afaik (ChatGPT o3) is still treating ARC as a series of characters. I imagine complex reasoning in multimodal processing wouldn't be nearly as advanced so treating it as characters is still better

interesting, I thought one of the whole points of o3 was mixed multimodal reasoning (e.g. everyone doing those geoguesser challenges). But maybe that's just a parlor trick and it's not actually implemented that way. I wonder when they're going to extend chain-of-thought to work with image tokens, seems like that'd help for solving spatial challenges like this.

  • I can't speak to whether it is a parlor trick, but my gut is that processing a 30x30 grid isn't really representative of o3's image processing. This tiny grid isn't like any image it would encounter normally and is so short that the benefits of language processing outweight the downsides.

    I expect that for a much larger images (e.g., 300x300 grids) and for problems simpler than ARC, that o3's image processing would give it a lead over o3 processing a very long character stream.