← Back to context

Comment by lukebechtel

9 days ago

Arc-AGI-2: 84.6% (vs 68.8% for Opus 4.6)

Wow.

https://blog.google/innovation-and-ai/models-and-research/ge...

Even before this, Gemini 3 has always felt unbelievably 'general' for me. It can beat Balatro (ante 8) with text description of the game alone[0]. Yeah, it's not an extremely difficult goal for humans, but considering:

1. It's an LLM, not something trained to play Balatro specifically

2. Most (probably >99.9%) players can't do that at the first attempt

3. I don't think there are many people who posted their Balatro playthroughs in text form online

I think it's a much stronger signal of its 'generalness' than ARC-AGI. By the way, Deepseek can't play Balatro at all.

[0]: https://balatrobench.com/

  • Per BalatroBench, gemini-3-pro-preview makes it to round (not ante) 19.3 ± 6.8 on the lowest difficulty on the deck aimed at new players. Round 24 is ante 8's final round. Per BalatroBench, this includes giving the LLM a strategy guide, which first-time players do not have. Gemini isn't even emitting legal moves 100% of the time.

    • It beats ante eight 9 times out of 15 attempts. I do consider 60% winning chance very good for a first time player.

      The average is only 19.3 rounds because there is a bugged run where Gemini beats round 6 but the game bugs out when it attempts to sell Invisible Joker (a valid move)[0]. That being said, Gemini made a big mistake in round 6 that would have costed it the run at higher difficulty.

      [0]: given the existence of bugs like this, perhaps all the LLMs' performances are underestimated.

      4 replies →

  • Hi, BalatroBench creator here. Yeah, Google models perform well (I guess the long context + world knowledge capabilities). Opus 4.6 looks good on preliminary results (on par with Gemini 3 Pro). I'll add more models and report soon. Tbh, I didn't expect LLMs to start winning runs. I guess I have to move to harder stakes (e.g. red stake).

    • Thank you for the site! I've got a few suggestions:

      1. I think winrate is more telling than the average round number.

      2. Some runs are bugged (like Gemini's run 9) and should be excluded from the result. Selling Invisible Joker is always bugged, rendering all the runs with the seed EEEEEE invalid.

      3. Instead of giving them "strategy" like "flush is the easiest hand..." it's fairer to clarify some mechanisms that confuse human players too. e.g. "played" vs "scored".

      Especially, I think this kind of prompt gives LLM an unfair advantage and can skew the result:

      > ### Antes 1-3: Foundation

      > - *Priority*: One of your primary goals for this section of the game should be obtaining a solid Chips or Mult joker

      1 reply →

  • My experience also shows that Gemini has unique strength in “generalized” (read: not coding) tasks. Gemini 2.5 Pro and 3 Pro seems stronger at math and science for me, and their Deep Research usually works the hardest, as long as I run it during off-hours. Opus seems to beat Gemini almost “with one hand tied behind its back” in coding, but Gemini is so cheap that it’s usually my first stop for anything that I think is likely to be relatively simple. I never worry about my quota on Gemini like I do with Opus or Chat-GPT.

    Comparisons generally seem to change much faster than I can keep my mental model updated. But the performance lead of Gemini on more ‘academic’ explorations of science, math, engineering, etc has been pretty stable for the past 4 months or so, which makes it one of the longer-lasting trends for me in comparing foundation models.

    I do wish I could more easily get timely access to the “super” models like Deep Think or o3 pro. I never seem to get a response to requesting access, and have to wait for public access models to catch up, at which point I’m never sure if their capabilities have gotten diluted since the initial buzz died down.

    They all still suck at writing an actually good essay/article/literary or research review, or other long-form things which require a lot of experienced judgement to come up with a truly cohesive narrative. I imagine this relates to their low performance in humor - there’s just so much nuance and these tasks represent the pinnacle of human intelligence. Few humans can reliably perform these tasks to a high degree of performance either. I myself am only successful some percentage of the time.

    • > their Deep Research usually works the hardest

      That's sortof damning with faint praise I think. So, for $work I needed to understand the legal landscape for some regulations (around employment screening) so I kicked off a deep research for all the different countries. That was fineish, but tended to go off the rails towards the end.

      So, then I split it out into Americas, APAC and EMEA requirements. This time, I spent the time checking all of the references (or almost all anyways), and they were garbage. Like, it ~invented a term and started telling me about this new thing, and when I looked at the references they had no information about the thing it was talking about.

      It linked to reddit for an employment law question. When I read the reddit thread, it didn't even have any support for the claims. It contradicted itself from the beginning to the end. It claimed something was true in Singapore, based on a Swedish source.

      Like, I really want this to work as it would be a massive time-saver, but I reckon that right now, it only saves time if you don't want to check the sources, as they are garbage. And Google make a business of searching the web, so it's hard for me to understand why this doesn't work better.

      I'm becoming convinced that this technology doesn't work for this purpose at the moment. I think that it's technically possible, but none of the major AI providers appear to be able to do this well.

      2 replies →

  • Agreed. Gemini 3 Pro for me has always felt like it has had a pretraining alpha if you will. And many data points continue to support that. Even as flash, which was post trained with different techniques than pro is good or equivalent at tasks which require post training, occasionally even beating pro. (eg: in apex bench from mercor, which is basically a tool calling test - simplifying - flash beats pro). The score on arc agi2 is another datapoint in the same direction. Deepthink is sort of parallel test time compute with some level of distilling and refinement from certain trajectories (guessing based on my usage and understanding) same as gpt-5.2-pro and can extract more because of pretraining datasets.

    (i am sort of basing this on papers like limits of rlvr, and pass@k and pass@1 differences in rl posttraining of models, and this score just shows how "skilled" the base model was or how strong the priors were. i apologize if this is not super clear, happy to expand on what i am thinking)

  • > . I don't think there are many people who posted their Balatro playthroughs in text form online

    There are *tons* of balatro content on YouTube though, and it makes absolutely zero doubt that Google is using YouTube content to train their model.

  • It's trained on YouTube data. It's going to get roffle and drspectred at the very least.

  • Google has a library of millions of scanned books from their Google Books project that started in 2004. I think we have reason to believe that there are more than a few books about effectively playing different traditional card games in there, and that an LLM trained with that dataset could generalize to understand how to play Balatro from a text description.

    Nonetheless I still think it's impressive that we have LLMs that can just do this now.

    • Winning in Balatro has very little to do with understanding how to play traditional poker. Yes, you do need a basic knowledge of different types of poker hands, but the strategy for succeeding in the game is almost entirely unrelated to poker strategy.

  • I don't think it'd need Balatro playthroughs to be in text form though. Google owns YouTube and has been doing automatic transcriptions of vocalized content on most videos these days, so it'd make sense that they used those subtitles, at the very least, as training data.

  • Yes, agentic-wise, Claude Opus is best. Complex coding is GPT-5.x. But for smartness, I always felt Gemini 3 Pro is best.

    • Can you give an example of smartness where Gemini is better than the other 2? I have found Gemini 3 pro the opposite of smartness on the tasks I gave him (evaluation, extraction, copy writing, judging, synthesising ) with gpt 5.2 xhigh first and opus 4.5/4.6 second. Not to mention it likes to hallucinate quite a bit .

      1 reply →

  • Strange, because I could not for the life of me get Gemini 3 to follow my instructions the other day to work through an example with a table, Claude got it first try.

    • Claude is king for agentic workflows right now because it’s amazing at tool calling and following instructions well (among other things)

      2 replies →

  • But... there's Deepseek v3.2 in your link (rank 7)

    • Grok (rank 6) and below didn't beat the game even once.

      Edit: in my original comment I said it wrong. I meant to say Deepseek can't beat Balatro at all, not can't play. Sorry

  • > Most (probably >99.9%) players can't do that at the first attempt

    Eh, both myself and my partner did this. To be fair, we weren’t going in completely blind, and my partner hit a Legendary joker, but I think you might be slightly overstating the difficulty. I’m still impressed that Gemini did it.

Weren't we barely scraping 1-10% on this with state of the art models a year ago and it was considered that this is the final boss, ie solve this and its almost AGI-like?

I ask because I cannot distinguish all the benchmarks by heart.

  • François Chollet, creator of ARC-AGI, has consistently said that solving the benchmark does not mean we have AGI. It has always been meant as a stepping stone to encourage progress in the correct direction rather than as an indicator of reaching the destination. That's why he is working on ARC-AGI-3 (to be released in a few weeks) and ARC-AGI-4.

    His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.

    • > His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.

      That is the best definition I've yet to read. If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.

      Thats said, I'm reminded of the impossible voting tests they used to give black people to prevent them from voting. We dont ask nearly so much proof from a human, we take their word for it. On the few occasions we did ask for proof it inevitably led to horrific abuse.

      Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.

      67 replies →

    • Please let’s hold M Chollet to account, at least a little. He launched ARC claiming transformer architectures could never do it and that he thought solving it would be AGI. And he was smug about it.

      ARC 2 had a very similar launch.

      Both have been crushed in far less time without significantly different architectures than he predicted.

      It’s a hard test! And novel, and worth continuing to iterate on. But it was not launched with the humility your last sentence describes.

      4 replies →

    • Hello Gemini, please fix:

      Biological Aging: Find the cellular "reset switch" so humans can live indefinitely in peak physical health.

      Global Hunger: Engineer a food system where nutritious meals are a universal right and never a scarcity.

      Cancer: Develop a precision "search and destroy" therapy that eliminates every malignant cell without side effects.

      War: Solve the systemic triggers of conflict to transition humanity into an era of permanent global peace.

      Chronic Pain: Map the nervous system to shut off persistent physical suffering for every person on Earth.

      Infectious Disease: Create a universal shield that detects and neutralizes any pathogen before it can spread.

      Clean Energy: Perfect nuclear fusion to provide the world with limitless, carbon-free power forever.

      Mental Health: Unlock the brain's biology to fully cure depression, anxiety, and all neurological disorders.

      Clean Water: Scale low-energy desalination so that safe, fresh water is available in every corner of the globe.

      Ecological Collapse: Restore the Earth’s biodiversity and stabilize the climate to ensure a thriving, permanent biosphere.

    • ARC-AGI-3 uses dynamic games that LLMs must determine the rules and is MUCH harder. LLMs can also be ranked on how many steps they required.

    • I don't think the creator believes ARC3 can't be solved but rather that it can't be solved "efficiently" and >$13 per task for ARC2 is certainly not efficient.

      But at this rate, the people who talk about the goal posts shifting even once we achieve AGI may end up correct, though I don't think this benchmark is particularly great either.

  • Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)

I'm excited for the big jump in ARC-AGI scores from recent models, but no one should think for a second this is some leap in "general intelligence".

I joke to myself that the G in ARC-AGI is "graphical". I think what's held back models on ARC-AGI is their terrible spatial reasoning, and I'm guessing that's what the recent models have cracked.

Looking forward to ARC-AGI 3, which focuses on trial and error and exploring a set of constraints via games.

  • Agreed. I love the elegance of ARC, but it always felt like a gotcha to give spatial reasoning challenges to token generators- and the fact that the token generators are somehow beating it anyway really says something.

  • The average ARC AGI 2 score for a single human is around 60%.

    "100% of tasks have been solved by at least 2 humans (many by more) in under 2 attempts. The average test-taker score was 60%."

    https://arcprize.org/arc-agi/2/

    • Worth keeping in mind that in this case the test takers were random members of the general public. The score of e.g. people with bachelor's degrees in science and engineering would be significantly higher.

      2 replies →

    • What is the point of comparing performance of these tools to humans? Machines have been able to accomplish specific tasks better than humans since the industrial revolution. Yet we don't ascribe intelligence to a calculator.

      None of these benchmarks prove these tools are intelligent, let alone generally intelligent. The hubris and grift are exhausting.

      13 replies →

  • Wouldn't you deal with spatial reasoning by giving it access to a tool that structures the space in a way it can understand or just is a sub-model that can do spatial reasoning? These "general" models would serve as the frontal cortex while other models do specialized work. What is missing?

https://arcprize.org/leaderboard

$13.62 per task - so we need another 5-10 years for the price to run this to become reasonable?

But the real question is if they just fit the model to the benchmark.

Well, fair comparison would be with GPT-5.x Pro, which is the same class of a model as Gemini Deep Think.

Arc-AGI (and Arc-AGI-2) is the most overhyped benchmark around though.

It's completely misnamed. It should be called useless visual puzzle benchmark 2.

It's a visual puzzle, making it way easier for humans than for models trained on text firstly. Secondly, it's not really that obvious or easy for humans to solve themselves!

So the idea that if an AI can solve "Arc-AGI" or "Arc-AGI-2" it's super smart or even "AGI" is frankly ridiculous. It's a puzzle that means nothing basically, other than the models can now solve "Arc-AGI"

  • The puzzles are calibrated for human solve rates, but otherwise I agree.

    • My two elderly parents cannot solve Arc-AGI puzzles, but can manage to navigate the physical world, their house, garden, make meals, clean the house, use the TV, etc.

      I would say they do have "general intelligence", so whatever Arc-AGI is "solving" it's definitely not "AGI"

      5 replies →

Am I the only one that can’t find Gemini useful except if you want something cheap? I don’t get what was the whole code red about or all that PR. To me I see no reason to use Gemini instead of of GPT and Anthropic combo. I should add that I’ve tried it as chat bot, coding through copilot and also as part of a multi model prompt generation.

Gemini was always the worst by a big margin. I see some people saying it is smarter but it doesn’t seem smart at all.

  • You are not the only one, it's to the point where I think that these benchmark results must be faked somehow because it doesn't match my reality at all.

  • I find the quality is not consistent at all and of all the LLMs I use Gemini is the one most likely to just verge off and ignore my instructions.

    • Same, as far as I am concerned, Gemini is optimized for benchmarks.

      I mean last week it insisted suddenly on two consecutive prompts that my code was in python. It was in rust.

  • maybe it depends on the usage, but in my experience most of the times the Gemini produces much better results for coding, especially for optimization parts. The results that were produced by Claude wasn't even near that of Gemini. But again, depends on the task I think.

We can really look at it both ways. It is actually concerning that a model that won IMO last summer would still fail 15% of ARC AGI 2.

I’m surprised that gemini 3 pro is so low at 31.1% though compared to opus 4.6 and gpt 5.2. This is a great achievement but its only available to ultra subscribers unfortunately

At $13.62 per task it's practically unusable for agent tasks due to the cost.

I found that anything over $2/task on Arc-AGI-2 ends up being way to much for use in coding agents.

I read somewhere that Google will ultimately always produce the best LLMs, since "good AI" relies on massive amounts of data and Google owns the most data.

Is that a based assumption?

I mean, remember when ARC 1 was basically solved, and then ARC 2 (which is even easier for humans) came out, and all of the sudden the same models that were doing well on ARC 1 couldn’t even get 5% on ARC 2? Not convinced this isn’t data leakage.