← Back to context

Comment by raincole

6 days ago

Even before this, Gemini 3 has always felt unbelievably 'general' for me. It can beat Balatro (ante 8) with text description of the game alone[0]. Yeah, it's not an extremely difficult goal for humans, but considering:

1. It's an LLM, not something trained to play Balatro specifically

2. Most (probably >99.9%) players can't do that at the first attempt

3. I don't think there are many people who posted their Balatro playthroughs in text form online

I think it's a much stronger signal of its 'generalness' than ARC-AGI. By the way, Deepseek can't play Balatro at all.

[0]: https://balatrobench.com/

Per BalatroBench, gemini-3-pro-preview makes it to round (not ante) 19.3 ± 6.8 on the lowest difficulty on the deck aimed at new players. Round 24 is ante 8's final round. Per BalatroBench, this includes giving the LLM a strategy guide, which first-time players do not have. Gemini isn't even emitting legal moves 100% of the time.

  • It beats ante eight 9 times out of 15 attempts. I do consider 60% winning chance very good for a first time player.

    The average is only 19.3 rounds because there is a bugged run where Gemini beats round 6 but the game bugs out when it attempts to sell Invisible Joker (a valid move)[0]. That being said, Gemini made a big mistake in round 6 that would have costed it the run at higher difficulty.

    [0]: given the existence of bugs like this, perhaps all the LLMs' performances are underestimated.

Hi, BalatroBench creator here. Yeah, Google models perform well (I guess the long context + world knowledge capabilities). Opus 4.6 looks good on preliminary results (on par with Gemini 3 Pro). I'll add more models and report soon. Tbh, I didn't expect LLMs to start winning runs. I guess I have to move to harder stakes (e.g. red stake).

  • Thank you for the site! I've got a few suggestions:

    1. I think winrate is more telling than the average round number.

    2. Some runs are bugged (like Gemini's run 9) and should be excluded from the result. Selling Invisible Joker is always bugged, rendering all the runs with the seed EEEEEE invalid.

    3. Instead of giving them "strategy" like "flush is the easiest hand..." it's fairer to clarify some mechanisms that confuse human players too. e.g. "played" vs "scored".

    Especially, I think this kind of prompt gives LLM an unfair advantage and can skew the result:

    > ### Antes 1-3: Foundation

    > - *Priority*: One of your primary goals for this section of the game should be obtaining a solid Chips or Mult joker

    • Im pretty open to feedback and contribution (also regarding the default strategy). So feel free to open Issues on GH. However I'd like to collect a bunch of them (including bugs) before re-running the whole benchmark (balatrobench v2).

  • Did you consider doing it as a computer use task? Probably I find those more compelling

    It's what I did for my game benchmark https://d.erenrich.net/paperclip-bench/index.html

    • not really. I've downloaded balatro. I saw that it was moddable. I wrote a mod API to interact programmatically. I was just curious if, from text only game state representation, a LLM would be able to make some decent play. the benchmark was a late pivoting.

My experience also shows that Gemini has unique strength in “generalized” (read: not coding) tasks. Gemini 2.5 Pro and 3 Pro seems stronger at math and science for me, and their Deep Research usually works the hardest, as long as I run it during off-hours. Opus seems to beat Gemini almost “with one hand tied behind its back” in coding, but Gemini is so cheap that it’s usually my first stop for anything that I think is likely to be relatively simple. I never worry about my quota on Gemini like I do with Opus or Chat-GPT.

Comparisons generally seem to change much faster than I can keep my mental model updated. But the performance lead of Gemini on more ‘academic’ explorations of science, math, engineering, etc has been pretty stable for the past 4 months or so, which makes it one of the longer-lasting trends for me in comparing foundation models.

I do wish I could more easily get timely access to the “super” models like Deep Think or o3 pro. I never seem to get a response to requesting access, and have to wait for public access models to catch up, at which point I’m never sure if their capabilities have gotten diluted since the initial buzz died down.

They all still suck at writing an actually good essay/article/literary or research review, or other long-form things which require a lot of experienced judgement to come up with a truly cohesive narrative. I imagine this relates to their low performance in humor - there’s just so much nuance and these tasks represent the pinnacle of human intelligence. Few humans can reliably perform these tasks to a high degree of performance either. I myself am only successful some percentage of the time.

  • > their Deep Research usually works the hardest

    That's sortof damning with faint praise I think. So, for $work I needed to understand the legal landscape for some regulations (around employment screening) so I kicked off a deep research for all the different countries. That was fineish, but tended to go off the rails towards the end.

    So, then I split it out into Americas, APAC and EMEA requirements. This time, I spent the time checking all of the references (or almost all anyways), and they were garbage. Like, it ~invented a term and started telling me about this new thing, and when I looked at the references they had no information about the thing it was talking about.

    It linked to reddit for an employment law question. When I read the reddit thread, it didn't even have any support for the claims. It contradicted itself from the beginning to the end. It claimed something was true in Singapore, based on a Swedish source.

    Like, I really want this to work as it would be a massive time-saver, but I reckon that right now, it only saves time if you don't want to check the sources, as they are garbage. And Google make a business of searching the web, so it's hard for me to understand why this doesn't work better.

    I'm becoming convinced that this technology doesn't work for this purpose at the moment. I think that it's technically possible, but none of the major AI providers appear to be able to do this well.

    • Oh yeah, LLMs currently spew a lot of garbage. Everything has to be double-checked. I mainly use them for gathering sources and pointing out a few considerations I might have otherwise overlooked. I often run them a few times, because they go off the rails in different directions, but sometimes those directions are helpful for me in expanding my understanding.

      I still have to synthesize everything from scratch myself. Every report I get back is like "okay well 90% of this has to be thrown out" and some of them elicit a "but I'm glad I got this 10%" from me.

      For me it's less about saving time, and more about potentially unearthing good sources that my google searches wouldn't turn up, and occasionally giving me a few nuggets of inspiration / new rabbit holes to go down.

      Also, Google changed their business from Search, to Advertising. Kagi does a much better job for me these days, and is easily worth the $5/mo I pay.

      1 reply →

Agreed. Gemini 3 Pro for me has always felt like it has had a pretraining alpha if you will. And many data points continue to support that. Even as flash, which was post trained with different techniques than pro is good or equivalent at tasks which require post training, occasionally even beating pro. (eg: in apex bench from mercor, which is basically a tool calling test - simplifying - flash beats pro). The score on arc agi2 is another datapoint in the same direction. Deepthink is sort of parallel test time compute with some level of distilling and refinement from certain trajectories (guessing based on my usage and understanding) same as gpt-5.2-pro and can extract more because of pretraining datasets.

(i am sort of basing this on papers like limits of rlvr, and pass@k and pass@1 differences in rl posttraining of models, and this score just shows how "skilled" the base model was or how strong the priors were. i apologize if this is not super clear, happy to expand on what i am thinking)

> . I don't think there are many people who posted their Balatro playthroughs in text form online

There are *tons* of balatro content on YouTube though, and it makes absolutely zero doubt that Google is using YouTube content to train their model.

It's trained on YouTube data. It's going to get roffle and drspectred at the very least.

Google has a library of millions of scanned books from their Google Books project that started in 2004. I think we have reason to believe that there are more than a few books about effectively playing different traditional card games in there, and that an LLM trained with that dataset could generalize to understand how to play Balatro from a text description.

Nonetheless I still think it's impressive that we have LLMs that can just do this now.

  • Winning in Balatro has very little to do with understanding how to play traditional poker. Yes, you do need a basic knowledge of different types of poker hands, but the strategy for succeeding in the game is almost entirely unrelated to poker strategy.

  • If it tried to play Balatro using knowledge of, e.g., poker, it would lose badly rather than win. Have you played?

    • I think I weakly disagree. Poker players have intuitive sense of the statistics of various hand types showing up, for instance, and that can be a useful clue as to which build types are promising.

      1 reply →

I don't think it'd need Balatro playthroughs to be in text form though. Google owns YouTube and has been doing automatic transcriptions of vocalized content on most videos these days, so it'd make sense that they used those subtitles, at the very least, as training data.

Yes, agentic-wise, Claude Opus is best. Complex coding is GPT-5.x. But for smartness, I always felt Gemini 3 Pro is best.

  • Can you give an example of smartness where Gemini is better than the other 2? I have found Gemini 3 pro the opposite of smartness on the tasks I gave him (evaluation, extraction, copy writing, judging, synthesising ) with gpt 5.2 xhigh first and opus 4.5/4.6 second. Not to mention it likes to hallucinate quite a bit .

    • I use it for classic engineering a lot, it beats out chatgpt and opus (I haven't tried as much with opus as chagpt though). Flash is also way stronger than it should be

Strange, because I could not for the life of me get Gemini 3 to follow my instructions the other day to work through an example with a table, Claude got it first try.

  • Claude is king for agentic workflows right now because it’s amazing at tool calling and following instructions well (among other things)

    • I've asked Gemini to not use phrases like "final boss" and to not generate summary tables unless asked to do so, yet it always ignores my instructions.

But... there's Deepseek v3.2 in your link (rank 7)

  • Grok (rank 6) and below didn't beat the game even once.

    Edit: in my original comment I said it wrong. I meant to say Deepseek can't beat Balatro at all, not can't play. Sorry

> Most (probably >99.9%) players can't do that at the first attempt

Eh, both myself and my partner did this. To be fair, we weren’t going in completely blind, and my partner hit a Legendary joker, but I think you might be slightly overstating the difficulty. I’m still impressed that Gemini did it.