Comment by cesarvarela

7 days ago

LLMs have a large quantity of chess data and still can't play for shit.

Not anymore. This benchmark is for LLM chess ability: https://github.com/lightnesscaster/Chess-LLM-Benchmark?tab=r.... LLMs are graded according to FIDE rules so e.g. two illegal moves in a game leads to an immediate loss.

This benchmark doesn't have the latest models from the last two months, but Gemini 3 (with no tools) is already at 1750 - 1800 FIDE, which is approximately probably around 1900 - 2000 USCF (about USCF expert level). This is enough to beat almost everyone at your local chess club.

  • Wait, I may be missing something here. These benchmarks are gathered by having models play each other, and the second illegal move forfeits the game. This seems like a flawed method as the models who are more prone to illegal moves are going to bump the ratings of the models who are less likely.

    Additionally, how do we know the model isn’t benchmaxxed to eliminate illegal moves.

    For example, here is the list of games by Gemini-3-pro-preview. In 44 games it preformed 3 illegal moves (if I counted correctly) but won 5 because opponent forfeits due to illegal moves.

    https://chessbenchllm.onrender.com/games?page=5&model=gemini...

    I suspect the ratings here may be significantly inflated due to a flaw in the methodology.

    EDIT: I want to suggest a better methodology here (I am not gonna do it; I really really really don’t care about this technology). Have the LLMs play rated engines and rated humans, the first illegal move forfeits the game (same rules apply to humans).

    • The LLMs do play rated engines (maia and eubos). They provide the baselines. Gemini e.g. consistently beats the different maia versions.

      The rest is taken care of by elo. That is they then play each other as well, but it is not really possible for Gemini to have a higher elo than maia with such a small sample size (and such weak other LLMs).

      Elo doesn't let you inflate your score by playing low ranked opponents if there are known baselines (rated engines) because the rated engines will promptly crush your elo.

      You could add humans into the mix, the benchmark just gets expensive.

      6 replies →

    • That’s a devastating benchmark design flaw. Sick of these bullshit benchmarks designed solely to hype AI. AI boosters turn around and use them as ammo, despite not understanding them.

      4 replies →

  • Yeah, but 1800 FIDE players don't make illegal moves, and Gemini does.

    • 1800 FIDE players do make illegal moves. I believe they make about one to two orders of magnitude less illegal moves than Gemini 3 does here. IIRC the usual statistic for expert chess play is about 0.02% of expert chess games have an illegal move (I can look that up later if there's interest to be sure), but that is only the ones that made it into the final game notation (and weren't e.g. corrected at the board by an opponent or arbiter). So that should be a lower bound (hence why it could be up to one order lower, although I suspect two orders is still probably closer to the truth).

      Whether or not we'll see LLMs continue to get a lower error rate to make up for those orders of magnitude remains to be seen (I could see it go either way in the next two years based on the current rate of progress).

      3 replies →

  • Why do we care about this? Chess AI have long been solved problems and LLMs are just an overly brute forced approach. They will never become very efficient chess players.

    The correct solution is to have a conventional chess AI as a tool and use the LLM as a front end for humanized output. A software engineer who proposes just doing it all via raw LLM should be fired.

Hm.. but do they need it.. at this point, we do have custom tools that beat humans. In a sense, all LLM need is a way to connect to that tool ( and the same is true is for counting and many other aspects ).

  • Yeah, but you know that manually telling the LLM to operate other custom tools is not going to be a long-term solution. And if an LLM could design, create, and operate a separate model, and then return/translate its results to you, that would be huge, but it also seems far away.

    But I'm ignorant here. Can anyone with a better background of SOTA ML tell me if this is being pursued, and if so, how far away it is? (And if not, what are the arguments against it, or what other approaches might deliver similar capacities?)

    • This has been happening for the past year on verifiable problems (did the change you made in your codebase work end-to-end, does this mathematical expression validate, did I win this chess match, etc...). The bulk of data, RL environment, and inference spend right now is on coding agents (or broadly speaking, tool use agents that can make their own tools).

      Recent advances in mathematical/physics research have all been with coding agents making their own "tools" by writing programs: https://openai.com/index/new-result-theoretical-physics/