Comment by S1M0N38-hn

6 days ago

Hi, BalatroBench creator here. Yeah, Google models perform well (I guess the long context + world knowledge capabilities). Opus 4.6 looks good on preliminary results (on par with Gemini 3 Pro). I'll add more models and report soon. Tbh, I didn't expect LLMs to start winning runs. I guess I have to move to harder stakes (e.g. red stake).

4 comments

S1M0N38-hn

raincole 5 days ago

Thank you for the site! I've got a few suggestions:

1. I think winrate is more telling than the average round number.

2. Some runs are bugged (like Gemini's run 9) and should be excluded from the result. Selling Invisible Joker is always bugged, rendering all the runs with the seed EEEEEE invalid.

3. Instead of giving them "strategy" like "flush is the easiest hand..." it's fairer to clarify some mechanisms that confuse human players too. e.g. "played" vs "scored".

Especially, I think this kind of prompt gives LLM an unfair advantage and can skew the result:

> ### Antes 1-3: Foundation

> - *Priority*: One of your primary goals for this section of the game should be obtaining a solid Chips or Mult joker

S1M0N38-hn 5 days ago

Im pretty open to feedback and contribution (also regarding the default strategy). So feel free to open Issues on GH. However I'd like to collect a bunch of them (including bugs) before re-running the whole benchmark (balatrobench v2).

brokensegue 5 days ago

Did you consider doing it as a computer use task? Probably I find those more compelling

It's what I did for my game benchmark https://d.erenrich.net/paperclip-bench/index.html

S1M0N38-hn 5 days ago

not really. I've downloaded balatro. I saw that it was moddable. I wrote a mod API to interact programmatically. I was just curious if, from text only game state representation, a LLM would be able to make some decent play. the benchmark was a late pivoting.