Comment by saberience

6 days ago

I’m not sure the arcagi are interesting benchmarks, for one they are image based and for two most people I show them too have issues understanding them, and in fact I had issues understanding them.

Given the models don’t even see the versions we get to see it doesn’t surprise me they have issues we these. It’s not hard to make benchmarks that are so hard that humans and Lims can’t do.

"most people I show them too have issues understanding them, and in fact I had issues understanding them" ??? those benchmarks are so extremely simple they have basically 100% human approval rates, unless you are saying "I could not grasp it immediately but later I was able to after understanding the point" I think you and your friends should see a neurologist. And I'm not mocking you, I mean seriously, those are tasks extremely basic for any human brain and even some other mammals to do.

  • > so extremely simple they have basically 100% human approval rates

    Are you thinking of a different set? Arc-agi-2 has average 60% success for a single person and questions require only 2 out of 9 correct answers to be accepted. https://docs.google.com/presentation/d/1hQrGh5YI6MK3PalQYSQs...

    > and even some other mammals to do.

    No, that's not the case.

    • No, I think I saw the graphs on someone's channel, but maybe I misinterpreted the results. But to be fair, my point never depended on 100% of the participants being right 100% of the questions, there are innumerous factors that could affect your performance on those tests, including the pressure. The AI also had access to lenient conventions, so it should be "fair" in this sense.

      Either way, there's something fishy about this presentation, it says: "ARC-AGI-1 WAS EASILY BRUTE-FORCIBLE", but when o3 initially "solved" most of it the co-founder or ARC-PRIZE said: "Despite the significant cost per task, these numbers aren't just the result of applying brute force compute to the benchmark. OpenAI's new o3 model represents a significant leap forward in AI's ability to adapt to novel tasks. This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs. o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.", he was saying confidently that it would not be a result of brute-forcing the problems. And it was not the first time, "ARC-AGI-1 consists of 800 puzzle-like tasks, designed as grid-based visual reasoning problems. These tasks, trivial for humans but challenging for machines, typically provide only a small number of example input-output pairs (usually around three). This requires the test taker (human or AI) to deduce underlying rules through abstraction, inference, and prior knowledge rather than brute-force or extensive training."

      Now they are saying ARC-AGI-2 is not bruteforcible, what is happening there? They didn't provided any reasoning for why one was bruteforcible and the other not, nor how they are so sure about that. They "recognized" that it could be brute-forced before, but in a way less expressive manner, by explicitly stating it would need "unlimited resources and time" to solve. And they are using the non-bruteforceability in this presentation as a point for it.

      --- Also, I mentioned mammals because those problems are of an order that mammals and even other animals would need to solve in reality for a diversity of cases. I'm not saying that they would literally be able to take the test and solve it, nor to understand this is a test, but that they would need to solve problems of similar nature in reality. Naturally this point has it's own limits, but it's not easily discarded as you tried to do.

      2 replies →

  • lol 100% approval rates? No they don’t.

    Also mammals? What mammals could even understand we were giving it a test?

    Have you seen them or shown them to average people? I’m sure the people who write them understand them but if you show these problems to average people in the street they are completely clueless.

    This is a classic case of some phd ai guys making a benchmark and not really considering what average people are capable of.

    Look, these insanely capable ai systems can’t do these problems but the boys in the lab can do them, what a good benchmark.

    • quoting my own previous response: > Also, I mentioned mammals because those problems are of an order that mammals and even other animals would need to solve in reality for a diversity of cases. I'm not saying that they would literally be able to take the test and solve it, nor to understand this is a test, but that they would need to solve problems of similar nature in reality. Naturally this point has it's own limits, but it's not easily discarded as you tried to do.

      ---

      > Have you seen them or shown them to average people? I’m sure the people who write them understand them but if you show these problems to average people in the street they are completely clueless.

      I can show them to people on my family, I'll do it today and come back with the answer, it's the best way of testing that out.

arc agi is the closest any widely used benchmark is coming to an IQ test, its straight logic/reasoning. Looking at the problem set its hard for me to choose a better benchmark for "when this is better than humans we have agi"

  • There are humans who cannot do arc agi though so how does an LLM not doing it mean that LLMs don’t have general intelligence?

    LLMs have obviously reached the point where they are smarter than almost every person alive, better at maths, physics, biology, English, foreign languages, etc.

    But because they can’t solve this honestly weird visual/spatial reasoning test they aren’t intelligent?

    That must mean most humans on this planet aren’t generally intelligent too.

    • > LLMs have obviously reached the point where they are smarter than almost every person alive, better at maths, physics, biology, English, foreign languages, etc.

      I dont think memorizing stuff is the same as being smart. https://en.wikipedia.org/wiki/Chinese_room

      > But because they can’t solve this honestly weird visual/spatial reasoning test they aren’t intelligent?

      Yes. Being intelligent is about recognizing patterns and thats what arc agi tests. It tests ability to learn. A lot of people are not very smart.

      5 replies →