I feel like the article neglects one obvious possibility: that OpenAI decided that chess was a benchmark worth "winning", special-cases chess within gpt-3.5-turbo-instruct, and then neglected to add that special-case to follow-up models since it wasn't generating sustained press coverage.
I suspect the same thing. Rather than LLMs “learning to play chess,” they “learnt” to recognise a chess game and hand over instructions to a chess engine. If that’s the case, I don’t feel impressed at all.
That's something completely different than what the OP suggests and would be a scandal if true (i.e. gpt-3.5-turbo-instruct actually using something else behind the scenes).
TBH I think a good AI would have access to a Swiss army knife of tools and know how to use them. For example a complicated math equation, using a calculator is just smarter than doing it in your head.
This seems quite likely to me, but did they special case it by reinforcement training it into the LLM (which would be extremely interesting in how they did it and what its internal representation looks like) or is it just that when you make an API call to OpenAI, the machine on the other end is not just a zillion-parameter LLM but also runs an instance of Stockfish?
Why couldn't they add a tool that literally calls stockfish or a chess ai behind the scenes with function calling and buffer the request before sending it back to the endpoint output interface?
As long as you are training it to make a tool call, you can add and remove anything you want behind the inference endpoint accessible to the public, and then you can plug the answer back into the chat ai, pass it through a moderation filter, and you might get good output from it with very little latency added.
Yes, came here to say exactly this. And it's possible this specific model is "cheating", for example by identifying a chess problem and forwarding it to a chess engine. A modern version of the Mechanical Turk.
That's the problem with closed models, we can never know what they're doing.
- "...for the closed (OpenAI) models I tried generating up to 10 times and if it still couldn’t come up with a legal move, I just chose one randomly."
- "I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization"
- "...if I gave a prompt like “1. e4 e5 2. ” (with a space at the end), the open models would play much, much worse than if I gave a prompt like “1 e4 e5 2.” (without a space)"
- "I used a temperature of 0.7 for all the open models and the default for the closed (OpenAI) models."
Between the tokenizer weirdness, temperature, quantization, random moves, and the chess prompt, there's a lot going on here. I'm unsure how to interpret the results. Fascinating article though!
Ah, buried in the post-article part. I was wondering how all of the models were seemingly capable of making legal moves, since last I saw something about LLMs playing Chess they were very much not capable of that.
Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.
I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.
Surprised I don't see more research into radicaly different tokenization.
FWIW I think most of the "tokenization problems" are in fact reasoning problems being falsely blamed on a minor technical thing when the issue is much more profound.
E.g. I still see people claiming that LLMs are bad at basic counting because of tokenization, but the same LLM counts perfectly well if you use chain-of-thought prompting. So it can't be explained by tokenization! The problem is reasoning: the LLM needs a human to tell it that a counting problem can be accurately solved if they go step-by-step. Without this assistance the LLM is likely to simply guess.
I’m the one who will fight you including with peer reviewed papers indicating that it is in fact due to tokenization. I’m too tired but will edit this for later, so take this as my bookmark to remind me to respond.
At a certain level they are identical problems. My strongest piece of evidence is that I get paid as an RLHF'er to find ANY case of error, including "tokenization". You know how many errors an LLM gets in the simplest grid puzzles, with CoT, with specialized models that don't try to "one-shot" problems, with multiple models, etc?
My assumption is that these large companies wouldn't pay hundreds of thousands of RLHF'ers through dozens of third party companies livable wages if tokenization errors were just that.
I think it's infeasible to train on bytes unfortunately, but yeah it also seems very wrong to use a handwritten and ultimately human version of tokens (if you take a look at the tokenizers out there you'll find fun things like regular expressions to change what is tokenized based on anecdotal evidence).
I keep thinking that if we can turn images into tokens, and we can turn audio into tokens, then surely we can create a set of tokens where the tokens are the model's own chosen representation for semantic (multimodal) meaning, and then decode those tokens back to text[1]. Obviously a big downside would be that the model can no longer 1:1 quote all text it's seen since the encoded tokens would need to be decoded back to text (which would be lossy).
[1] From what I could gather, this is exactly what OpenAI did with images in their gpt-4o report, check out "Explorations of capabilities": https://openai.com/index/hello-gpt-4o/
There’s a reason human brains have dedicated language handling. Tokenization is likely a solid strategy. The real thing here is that language is not a good way to encode all forms of knowledge
Going from tokens to bytes explodes the model size. I can’t find the reference at the moment, but reducing the average token size induces a corresponding quadratic increase in the width (size of each layer) of the model. This doesn’t just affect inference speed, but also training speed.
One neat thing about the AUNN idea is that when you operate at the function level, you get sort of a neural net version of lazy evaluation; in this case, because you train at arbitrary indices in arbitrary datasets you define, you can do whatever you want with tokenization (as long as you keep it consistent and don't retrain the same index with different values). You can format your data in any way you want, as many times as you want, because you don't have to train on 'the whole thing', any more than you have to evaluate a whole data structure in Haskell; you can just pull the first _n_ elements of an infinite list, and that's fine.
So there is a natural way to not just use a minimal bit or byte level tokenization, but every tokenization simultaneously: simply define your dataset to be a bunch of datapoints which are 'start-of-data token, then the byte encoding of a datapoint followed by the BPE encoding of that followed by the WordPiece encoding followed by ... until the end-of-data token'.
You need not actually store any of this on disk, you can compute it on the fly. So you can start by training only on the byte encoded parts, and then gradually switch to training only on the BPE indices, and then gradually switch to the WordPiece, and so on over the course of training. At no point do you need to change the tokenization or tokenizer (as far as the AUNN knows) and you can always switch back and forth or introduce new vocabularies on the fly, or whatever you want. (This means you can do many crazy things if you want. You could turn all documents into screenshots or PDFs, and feed in image tokens once in a while. Or why not video narrations? All it does is take up virtual indices, you don't have to ever train on them...)
A byte is itself sort of a token. So is a bit. It makes more sense to use more tokenizers in parallel than it does to try and invent an entirely new way of seeing the world.
Anyway humans have to tokenize, too. We don't perceive the world as a continuous blob either.
I would say that "humans have to tokenize" is almost precisely the opposite of how human intelligence works.
We build layered, non-nested gestalts out of real time analog inputs. As a small example, the meaning of a sentence said with the same precise rhythm and intonation can be meaningfully changed by a gesture made while saying it. That can't be tokenized, and that isn't what's happening.
How would we train it? Don't we need it to understand the heaps and heaps of data we already have "tokenized" e.g. the internet? Written words for humans? Genuinely curious how we could approach it differently?
Tokenization is not strictly speaking necessary (you can train on bytes). What it is is really really efficient. Scaling is a challenge as is, bytes would just blow that up.
This is probably unnecessary, but: I wish you wouldn't use the word "stupid" there. Even if you didn't mean anything by it personally, it might reinforce in an insecure reader the idea that, if one can't speak intelligently about some complex and abstruse subject that other people know about, there's something wrong with them, like they're "stupid" in some essential way. When in fact they would just be "ignorant" (of this particular subject). To be able to formulate those questions at all is clearly indicative of great intelligence.
hot take: LLM tokens is kanji for AI, and just like kanji it works okay sometimes but fails miserably for the task of accurately representating English
Why couldn’t Chinese characters accurately represent English? Japanese and Korean aren’t related to Chinese and still were written with Chinese characters (still are in the case of Japanese).
If England had been in the Chinese sphere of influence rather than the Roman one, English would presumably be written with Chinese characters too. The fact that it used an alphabet instead is a historical accident, not due to any grammatical property of the language.
You could absolutely write a tokenizer that would consistently tokenize all distinct English words as distinct tokens, with a 1:1 mapping.
But AFAIK there's no evidence that this actually improves anything, and if you spend that much of the dictionary on one language, it comes at the cost of making the encoding for everything else much less efficient.
I have seen a bunch of tokenization papers with various ideas but their results are mostly meh. I personally don't see anything principally wrong with current approaches. Having discrete symbols is how natural language works, and this might be an okayish approximation.
It's probably worth to play around with different prompts and different board positions.
For context this [1] is the board position the model is being prompted on.
There may be more than one weird thing about this experiment, for example giving instructions to the non-instruction tuned variants may be counter productive.
More importantly let's say you just give the model the truncated PGN, does this look like a position where white is a grandmaster level player? I don't think so. Even if the model understood chess really well it's going to try to predict the most probable move given the position at hand, if the model thinks that white is a bad player, and the model is good at understanding chess, it's going to predict bad moves as the more likely ones because that would better predict what is most likely to happen here.
Apparently I can find some matches for games that start like that between very strong players [1], so my hypothesis that the model may just be predicting bad moves on purpose seems wobbly, although having stockfish at the lowest level play as the supposedly very strong opponent may still be throwing the model off somewhat. In the charts the first few moves the model makes seem decent, if I'm interpreting these charts right, and after a few of those things seem to start going wrong.
Either way it's worth repeating the experiment imo, tweaking some of these variables (prompt guidance, stockfish strength, starting position, the name of the supposed players, etc.).
Interesting thought the LLM isn’t trying to win, it’s trying to produce data like the input data. It’s quite rare for a very strong player to play a very weak one. If you feed it lots of weak moves it’ll best replicate the training data by following with weak moves.
The experiment started from the first move of a game, and played each game fully. The position you linked was just an example of the format used to feed the game state to the model for each move.
What would "winning" or "losing" even mean if all of this was against a single move?
Does it ever try an illegal move? OP didn't mention this and I think it's inevitable that it should happen at least once, since the rules of chess are fairly arbitrary and LLMs are notorious for bullshitting their way through difficult problems when we'd rather they just admit that they don't have the answer.
> he discusses using a grammar to restrict to only legal moves
Whether a chess move is legal isn't primarily a question of grammar. It's a question of the board state. "White king to a5" is a perfectly legal move, as long as the white king was next to a5 before the move, and it's white's turn, and there isn't a white piece in a5, and a5 isn't threatened by black. Otherwise it isn't.
"White king to a9" is a move that could be recognized and blocked by a grammar, but how relevant is that?
Still an interesting direction of questioning. Maybe could be rephrased as "how much work is the grammar doing"? Are the results with the grammar very different than without? If/when a grammar is not used (like in the openai case), how many illegal moves does it try on average before finding a legal one?
I don't understand why educated people expect that an LLM would be able to play chess at a decent level.
It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.
Then you should be surprised that turbo-instruct actually plays well, right? We see a proliferation of hand-wavy arguments based on unfounded anthropomorphic intuitions about "actual reasoning" and whatnot. I think this is good evidence that nobody really understands what's going on.
If some mental model says that LLMs should be bad at chess, then it fails to explain why we have LLMs playing strong chess. If another mental model says the inverse, then it fails to explain why so many of these large models fail spectacularly at chess.
There are some who suggest that modern chess is mostly a game of memorization and not one particularly of strategy or skill. I assume this is why variants like speed chess exist.
In this scope, my mental model is that LLMs would be good at modern style long form chess, but would likely be easy to trip up with certain types of move combinations that most humans would not normally use. My prediction is that once found they would be comically susceptible to these patterns.
Clearly, we have no real basis for saying it is "good" or "bad" at chess, and even using chess performance as an measurement sample is a highly biased decision, likely born out of marketing rather than principle.
"playing strong chess" would be a much less hand-wavy claim if there were lots of independent methods of quantifying and verifying the strength of stockfish's lowest difficulty setting. I honestly don't know if that exists or not. But unless it does, why would stockfish's lowest difficulty setting be a meaningful threshold?
But to some approximation we do know how an LLM plays chess.. based on all the games, sites, blogs, analysis in its training data. But it has a limited ability to tell a good move from a bad move since the training data has both, and some of it lacks context on move quality.
Here's an experiment: give an LLM a balanced middle game board position and ask it "play a new move that a creative grandmaster has discovered, never before played in chess and explain the tactics and strategy behind it". Repeat many times. Now analyse each move in an engine and look at the distribution of moves and responses. Hypothesis: It is going to come up with a bunch of moves all over the ratings map with some sound and some fallacious arguments.
I really don't think there's anything too mysterious going on here. It just synthesizes existing knowledge and gives answers that includes bit hits, big misses and everything in between. Creators chip away at the edges to change that distribution but the fundamental workings don't change.
One of the main purposes of running experiments of any sort is to find out if our preconceptions are accurate. Of course, if someone is not interested in that question, they might as well choose not to look through the telescope.
This is a puzzle given enough training information. LLM can successfully print out the status of the board after the given moves. It can also produce a not-terrible summary of the position and is able to list dangers at least one move ahead. Decent is subjective, but that should beat at least beginners. And the lowest level of stockfish used in the blog post is lowest intermediate.
I don't know really what level we should be thinking of here, but I don't see any reason to dismiss the idea. Also, it really depends on whether you're thinking of the current public implementations of the tech, or the LLM idea in general. If we wanted to get better results, we could feed it way more chess books and past game analysis.
LLMs like GPT aren’t built to play chess, and here’s why: they’re made for handling language, not playing games with strict rules and strategies. Chess engines, like Stockfish, are designed specifically for analyzing board positions and making the best moves, but LLMs don’t even "see" the board. They’re just guessing moves based on text patterns, without understanding the game itself.
Plus, LLMs have limited memory, so they struggle to remember previous moves in a long game. It’s like trying to play blindfolded! They’re great at explaining chess concepts or moves but not actually competing in a match.
Because it's a straight forward stochastic sequence modelling task and I've seen GPT-3.5-turbo-instruct play at high amateur level myself. But it seems like all the RLHF and distillation that is done on newer models destroys that ability.
PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close
"Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00"
https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...
> I don't understand why educated people expect that an LLM would be able to play chess at a decent level.
Because it would be super cool; curiosity isn't something to be frowned upon. If it turned out it did play chess reasonably well, it would mean emergent behaviour instead of just echoing things said online.
But it's wishful thinking with this technology at this current level; like previous instances of chatbots and the like, while initially they can convince some people that they're intelligent thinking machines, this test proves that they aren't. It's part of the scientific process.
They thought it because we have an existence proof: gpt-3.5-turbo-instruct can play chess at a decent level.
That was the point of the post (though you have to read it to the end to see this). That one model can play chess pretty well, while the free models and OpenAI's later models can't. That's weird.
I suppose you didn't get the news, but google developed a LLM that can play chess. And play it at grandmaster level: https://arxiv.org/html/2402.04494v1
Not quite an LLM. It's a transformer model, but there's no tokenizer or words, just chess board positions (64 tokens, one per board square). It's purpose-built for chess (never sees a word of text).
It's interesting to note that the paper benchmarked its chess playing performance against GPT-3.5-turbo-instruct, the only well performant LLM in the posted article.
Right, at least as of the ~GPT3 model it was just "predict what you would see in a chess game", not "what would be the best move". So (IIRC) users noted that if you made bad move, then the model would also reply with bad moves because it pattern matched to bad games. (I anthropomorphized this as the model saying "oh, we're doing dumb-people-chess now, I can do that too!")
But it also predicts moves where the text says "black won the game, [proceeds to show the game]". To minimize loss on that it would need to from context try and make it so white doesn't make critical mistakes.
Chess does not clearly require that. Various purely ML/statistical based model approaches are doing pretty well. It's almost certainly best to incorporate some kind of search into an overall system, but it's not absolutely required to play just decent amateur level.
The problem here is the specific model architecture, training data, vocabulary/tokenization method (if you were going to even represent a game this way... which you wouldn't), loss function and probably decoding strategy.... basically everything is wrong here.
> I don't understand why educated people expect that an LLM would be able to play chess at a decent level.
The blog post demonstrates that a LLM plays chess at a decent level.
The blog post explains why. It addresses the issue of data quality.
I don't understand what point you thought you were making. Regardless of where you stand, the blog post showcases a surprising result.
You stress your prior unfounded belief, you were presented with data that proves it wrong, and your reaction was to post a comment with a thinly veiled accusation of people not being educated when clearly you are the one that's off.
To make matters worse, this topic is also about curiosity. Which has a strong link with intelligence and education. And you are here criticizing others on those grounds in spite of showing your defitic right at the first sentence.
This blog post was a great read. Very surprising, engaging, and thought provoking.
The only service performing well is a closed source one that could simply use a real chess engine for questions that look like chess, for marketing purposes. There’s nothing thought provoking about a bunch of engineers doing “experiments” against a service, other than how sad it is to debase themselves in this way.
But there's really nothing about chess that makes reasoning a prerequisite, a win is a win as long as it's a win. This is kind of a semantics game: it's a question of whether the degree of skill people observe in an LLM playing chess is actually some different quantity than the chance it wins.
I mean at some level you're saying that no matter how close to 1 the win probability (1 - epsilon) gets, both of the following are true:
A. you should always expect for the computation that you're able to do via conscious reasoning alone to always be sufficient, at least in principle, to asymptotically get a higher win probability than a model, no matter what the model's win probability was to begin with
B. no matter how close to 1 that the model's win rate p=(1 - epsilon) gets, because logical inference is so non-smooth, the win rate on yet-unseen data is fundamentally algorithmically random/totally uncorrelated to in-distribution performance, so it's never appropriate to say that a model can understand or to reason
To me it seems that people are subject to both of these criteria, though. They have a tendency to cap out at their eventual skill cap unless given a challenge to nudge them to a higher level, and likewise possession of logical reasoning doesn't let us say much at all about situations that their reasoning is unfamiliar with.
I also think, if you want to say that what LLMs do has nothing to do with understanding or ability, then you also have to have an alternate explanation for the phenomenon of AlphaGo defeating Lee Sedol being a catalyst for top Go players being able to rapidly increase their own rankings shortly after.
Few people (perhaps none) expected LLMs to be good at chess. Nevertheless, as the article explains, there was buzz around a year ago that LLMs were good at chess.
> It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.
No. You can definitely train a model to be really good at chess without "actual reasoning and deterministic computation".
It sorta played chess- he let it generate up to ten moves, throwing away any that weren't legal, and if no legal move was generated by the 10th try he picked a random legal move. He does not say how many times he had to provide a random move, or how many times illegal moves were generated.
> I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization, whatever that is.
It's just a lossy compression of all of the parameters, probably not important, right?
i think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good. even a trillion games won't save you: https://en.wikipedia.org/wiki/Shannon_number
that said, for the sake of completeness, modern chess engines (with high quality chess-specific models as part of their toolset) are fully capable of, at minimum, tying every player alive or dead, every time. if the opponent makes one mistake, even very small, they will lose.
while writing this i absently wondered if you increased the skill level of stockfish, maybe to maximum, or perhaps at least an 1800+ elo player, you would see more successful games. even then, it will only be because the "narrower training data" (ie advanced players won't play trash moves) at that level will probably get you more wins in your graph, but it won't indicate any better play, it will just be a reflection of less noise; fewer, more reinforced known positions.
> i think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good. even a trillion games won't save you: https://en.wikipedia.org/wiki/Shannon_number
Indeed. As has been pointed out before, the number of possible chess positions easily, vastly dwarfs even the wildest possible estimate of the number of atoms in the known universe.
Sure, but so does the number of paragraphs in the english language, and yet LLMs seem to do pretty well at that. I don't think the number of configurations is particularly relevant.
(And it's honestly quite impressive that LLMs can play it at all, but not at all surprising that it loses pretty handily to something which is explicitly designed to search, as opposed to simply feed-forward a decision)
Since we're mentioning Shannon... What is the minimum representative sample size of that problem space? Is it close enough to the number of freely available chess moves on the Internet and in books?
> I think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good.
Yeah, once you've deviated from a sequence you're lost.
Maybe approaching it by learning the best move in billions/trillions of positions, and feeding that into some AI could work better. Similar positions often have the same kind of best move.
Honestly, I think that once you discard the moves one would never make, and account for symmetries/effectively similar board positions (ones that could be detected by a very simple pattern matcher), chess might not be that big a game at all.
I’d bet it’s using function calling out to a real chess engine. It could probably be proven with a timing analysis to see how inference time changes/doesn’t with number of tokens or game complexity.
?? why would openai even want to secretly embed chess function calling into an incredibly old model? if they wanted to trick people into thinking their models are super good at chess why wouldn't they just do that to gpt-4o?
OpenAI has a TON of experience making game-playing AI. That was their focus for years, if you recall. So it seems like they made one model good at chess to see if it had an overall impact on intelligence (just as learning chess might make people smarter, or learning math might make people smarter, or learning programming might make people smarter)
Playing is a thing strongly related to abstract representation of the game in game states. Even if player does not realize it, with chess it’s really about shallow or beam search within the possible moves.
LLMs don’t do reasoning or exploration, but they write text based on precious text. So to us it may seem playing, but is really a smart guesswork based on previous games. It’s like Kasparov writing moves without imagining the actual placement.
What would be interesting is to see whether a model, given only the rules, will play. I bet it won’t.
At this moment it’s replaying by memory but definitely not chasing goals. There’s no such think as forward attention yet, and beam search is expensive enough, so one would prefer to actually fallback to classic chess algos.
At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training. That's not something specific to LLMs or OpenAI. Compiler companies have done the same thing for decades, specifically detecting common benchmark programs and inserting hand-crafted optimizations. Similarly, the shader compilers in GPU drivers have special cases for common games and benchmarks.
Apples and oranges. VW actually cheated on regulatory testing to bypass legal requirements. So to be comparable, the government would first need to pass laws where e.g. only compilers that pass a certain benchmark are allowed to be used for purchasable products and then the developers would need to manipulate behaviour during those benchmarks.
Most of the time these days compiler writers are not cheating like VW did. In the 1980s compiler writers would insert code to recognize performance tests and then cheat - output values hard coded into the compiler instead of running the algorithm. Which is the type of thing that VW got in trouble for.
These days most compilers are trying to make the general case of code fast and they rarely look for benchmarks. I won't say they never do this - just that it is much less common - if only because magazine reviews/benchmarks are not nearly as important as they used to be and so the incentive is gone.
Only because what VW did is illegal, was super large scale, and could be linked to a lot of indirect deaths through the additional pollution.
Benchmark optimizations are slightly embarrassing at worst, and an "optimization for a specific use case" at best. There's no regulation against optimizing for a particular task, everyone does it all the time, in some cases it's just not communicated transparently.
Phone manufacturers were caught "optimizing" for benchmarks again and again, removing power limits to boost scores. Hard to name an example without searching the net because it's at most a faux pas.
Not quite. VW got in trouble for running _different_ software in test vs prod. These optimizations are all going to "prod" but are only useful for specific targets (a specific game in this case).
Can you try increasing compute in the problem search space, not in the training space? What this means is, give it more compute to think during inference by not forcing any model to "only output the answer in algebraic notation" but do CoT prompting:
"1. Think about the current board
2. Think about valid possible next moves and choose the 3 best by thinking ahead
3. Make your move"
Or whatever you deem a good step by step instruction of what an actual good beginner chess player might do.
Then try different notations, different prompt variations, temperatures and the other parameters. That all needs to go in your hyper-parameter-tuning.
One could try using DSPy for automatic prompt optimization.
Can be forced through inference with CoT type of stuff. Spend tokens at each stage to draw the board for example, then spend tokens restating the rules of the game, then spend token restating the heuristics like piece value, and then spend tokens doing a minmax n-ply search.
Wildly inefficient? Probably. Could maybe generate some python to make more efficient? Maybe, yeah.
Essentially user would have to teach gpt to play chess, or training would fine tune chess towards these CoT, fine tuning, etc...
Yeah, the expectation for the immediate answer is definitely results, especially for the later stages. Another possible improvement: every 2 steps, show the current board state and repeat the moves still to be processed, before analysing the final position.
2. It plays like what you'd expect from a LLM that could play chess. That is, level of play can be modulated by the prompt and doesn't manifest the same way shifting the level of stockfish etc does. Also the specific chess notation being prompted actually matters
5. You can or well you used to be able to inspect the logprobs. I think Open AI have stopped doing this but the link in 4 does show the author inspecting it for Turbo instruct.
> Also the specific chess notation being prompted actually matters
Couldn't this be evidence that it is using an engine? Maybe if you use the wrong notation it relies on the ANN rather than calling to the engine.
Likewise:
- The sensitivity to game history is interesting, but is it actually true that other chess engines only look at current board state? Regardless, maybe it's not an existing chess engine! I would think OpenAI has some custom chess engine built as a side project, PoC, etc. In particular this engine might be neural and trained on actual games rather than board positions, which could explain dependency on past moves. Note that the engine is not actually very good. Does AlphaZero depend on move history? (Genuine question, I am not sure. But it does seem likely.)
- I think the illegal moves can be explained similarly to why gpt-o1 sometimes screws up easy computations despite having access to Python: an LLM having access to a tool does not guarantee it always uses that tool.
I realize there are holes in the argument, but I genuinely don't think these holes are as big as the "why is gpt-3.5-turbo-instruct so much better at chess than gpt-4?"
The author thinks this is unlikely because it only has an ~1800 ELO. But OpenAI is shady as hell, and I could absolutely see the following purely hypothetical scenario:
- In 2022 Brockman and Sutskever have an unshakeable belief that Scaling Is All You Need, and since GPT-4 has a ton of chess in its pretraining data it will definitely be able to play competent amateur chess when it's finished.
- A ton of people have pointed out that ChatGPT-3.5 doesn't even slightly understand chess despite seeming fluency in the lingo. People start to whisper that transformers cannot actually create plans.
- Therefore OpenAI hatches an impulsive scheme: release an "instruction-tuned" GPT-3.5 with an embedded chess engine that is not a grandmaster, but can play competent chess, ideally just below the ELO that GPT-4 is projected to have.
- Success! The waters are muddied: GPT enthusiasts triumphantly announce that LLMs can play chess, it just took a bit more data and fine-tuning. The haters were wrong: look at all the planning GPT is doing!
- Later on, at OpenAI HQ...whoops! GPT-4 sucks at chess, as do competitors' foundation LLMs which otherwise outperform GPt-3.5. The scaling "laws" failed here, since they were never laws in the first place. OpenAI accepts that scaling transformers won't easily solve the chess problem, then realizes that if they include the chess engine with GPT-4 without publicly acknowledging it, then Anthropic and Facebook will call out the performance as aberrational and suspicious. But publicly acknowledging a chess engine is even worse: the only reason to include the chess engine is to mislead users into thinking GPT is capable of general-purpose planning.
- Therefore in later GPT versions they don't include the engine, but it's too late to remove it from gpt-3.5-turbo-instruct: people might accept the (specious) claim that GPT-4's size accidentally sabotaged its chess abilities, but they'll ask tough questions about performance degradation within the same model.
I realize this is convoluted and depends on conjecture. But OpenAI has a history with misleading demos - e.g. their Rubik's cube robot which in fact used a classical algorithm but was presented as reinforcement learning. I think "OpenAI lied" is the most likely scenario. It is far more likely than "OpenAI solved the problem honestly in GPT-3.5, but forgot how they did it with GPT-4," and a bit more likely than "scaling transformers slightly helps performance when playing Othello but severely sabotages performance when playing chess."
Eh, OpenAI really isn't as shady as hell, from what I've seen on the inside for 3 years. Rubik's cube hand was before me, but in my time here I haven't seen anything I'd call shady (though obviously the non-disparagement clauses were a misstep that's now been fixed). Most people are genuinely trying to build cool things and do right by our customers. I've never seen anyone try to cheat on evals or cheat customers, and we take our commitments on data privacy seriously.
I was one of the first people to play chess against the base GPT-4 model, and it blew my mind by how well it played. What many people don't realize is that chess performance is extremely sensitive to prompting. The reason gpt-3.5-turbo-instruct does so well is that it can be prompted to complete PGNs. All the other models use the chat format. This explains pretty much everything in the blog post. If you fine-tune a chat model, you can pretty easily recover the performance seen in 3.5-turbo-instruct.
Very good scenario. One variation: some researcher or division in OpenAI performs all of the above steps to get a raise. The whole field is predicated on rewarding the appearance of ability.
This is likely. From example games, it not only knows the rules (which would be impressive by itself, just making the legal moves is not trivial). It also has some planning capabilities (plays combinations of several moves).
Probably not calling out to one but it would not surprise me at all if they added more chess PGNs into their training data. Chess is a bit special in AI in that it’s still seen as a mark of pure intelligence in some respect.
If you tested it on an equally strategic but less popular game I highly doubt you would see the same performance.
I think that's the most plausible theory that would explain the sudden hike from gpt-3.5-turbo to gpt-3.5-turbo-instruct, and again the sudden regression in gpt-4*.
OpenAI also seem to augment the LLM with some type of VM or a Python interpreter. Maybe they run a simple chess engine such as Sunfish [1] which is around 1900-2000 ELO [2]?
Note: the possibility is not mentioned in the article but rather in the comments [1]. I had to click a bit to see it.
The fact that the one closed source model is the only one that plays well seems to me like a clear case of the interface doing some of the work. If you ask ChatGPT to count until 10000 (something that most LLMs can't do for known reasons) you get an answer that's clearly pre-programmed. I'm sure the same is happening here (and with many, many other tasks) - the author argues against it by saying "but why isn't it better?", which doesn't seem like the best argument: I can imagine that typical ChatGPT users enjoy the product more if they have a chance to win once in a while.
Just think about the trade off from OpenAI's side here - they're going to add a bunch of complexity to gpt3.5 to let it call out to engines (either an external system monitoring all outputs for chess related stuff, or some kind of tool-assisted CoT for instance) just so it can play chess incorrectly a high percentage of the time, and even when it doesn't at a mere 1800ELO level? In return for some mentions in a few relatively obscure blog posts? Doesn't make any sense to me as an explanation.
But there could be a simple explanation. For example, they could have tested many "engines" when developing function calling and they just left them in there. They just happened to connect to a basic chess playing algorithm and nothing sophisticated.
Also, it makes a lot of sense if you expect people to play chess against the LLM, especially if you are later training future models on the chats.
Could be a pilot implementation to learn about how to link up external specialist engines. Chess would be the obvious example to start with because the problem is so well known, standardized and specialist engines are easily available. If they ever want to offer an integration like that to customers (who might have some existing rule based engine in house), the need to know everything they can about expected cost, performance.
Sorry this is just consiracy theorizing. I've tried jt with GPT-3.5-instruct myself in the OpenAI playeground where the model clearly does nothing but auto-regression. No function calling there whatsoever.
Occam’s razor. I could build a good chess playing wrapper around OpenAPI (any version) that would consult a chess engine when presented with any board scenario, and introduce some randomness so that it doesn’t play too well.
I can’t imagine any programmer in this thread would be entertaining a more complicated scenario than this. You can substitute chess for any formal system that has a reliable oracle.
I don't necessarily believe this for a second but I'm going to suggest it because I'm feeling spicy.
OpenAI clearly downgrades some of their APIs from their maximal theoretic capability, for the purposes of response time/alignment/efficiency/whatever.
Multiple comments in this thread also say they couldn't reproduce the results for gpt3.5-turbo-instruct.
So what if the OP just happened to test at a time, or be IP bound to an instance, where the model was not nerfed? What if 3.5 and all subsequent OpenAI models can perform at this level but it's not strategic or cost effective for OpenAI to expose that consistently?
For the record, I don't actually believe this. But given the data it's a logical possibility.
> OpenAI clearly downgrades some of their APIs from their maximal theoretic capability, for the purposes of response time/alignment/efficiency/whatever.
When ChatGPT3.5 first came out, people were using it to simulate entire Linux system installs, and even browsing a simulated Internet.
Cool use cases like that aren't even discussed anymore.
I still wonder what sort of magic OpenAI had and then locked up away from the world in the name of cost savings.
Same thing with GPT 4 vs 4o, 4o is obviously worse in some ways, but after the initial release (when a bunch of people mentioned this), the issue has just been collectively ignored.
We know from experience with different humans that there are different types of skills and different types of intelligence. Some savants might be superhuman at one task but basically mentally disabled at all other things.
It could be that the model that does chess well just happens to have the right 'connectome' purely by accident of how the various back-propagations worked out to land on various local maxima (model weights) during training. It might even be (probably is) a non-verbal connectome that's just purely logic rules, having nothing to do with language at all, but a semantic space pattern that got landed on accidentally, which can solve this class of problem.
Reminds me of how Daniel Tammet just visually "sees" answers to math problems in his mind without even knowing how they appear. It's like he sees a virtual screen with a representation akin to numbers (the answer) just sitting there to be read out from his visual cortex. He's not 'working out' the solutions. They're just handed to him purely by some connectome effects going on in the background.
I agree with some of the other comments here that the prompt is limiting. The model can't do any computation without emitting tokens and limiting the numbers of tokens it can emit is going to limit the skill of the model. It's surprising that any model at all is capable of performing well with this prompt in fact.
Your issue is that the performance of these models at chess is incredibly sensitive to the prompt. If you have gpt-3.5-turbo-instruction complete a PGN transcript, then you'll see performance in the 1800 Elo range. If you ask in English or diagram the board, you'll see vastly degraded performance.
Unlike people, how you ask the question really really affects the output quality.
I remember one of the early "breakthroughs" for LLMs in chess was that if it could actually play legal moves(!) In all of these games are the models always playing legal moves? I don't think the article says. The fact that an LLM can even reliably play legal moves, 20+ moves into a chess game is somewhat remarkable. It needs to have an accurate representation of the board state even though it was only trained on next token prediction.
I did a very unscientific test and it did seem to just play legal moves. Not only that, if I did an illegal move it would tell me that I couldn't do it.
I think said that I wanted to play with new rules, where a queen could jump over any pawn, and it let me make that rule change -- and we played with this new rule. Unfortunately, I was trying to play in my head and I got mixed up and ended up losing my queen. Then I changed the rule one more time -- if you take the queen you lose -- so I won!
The author explains what they did: restrict the move options to valid ones when possible (for open models with the ability to enforce grammar during inference) or sample the model for a valid move up to ten times, then pick a random valid move.
My money is on a fluke inclusion of more chess data in that models training.
All the other models do vaguely similarly well in other tasks and are in many cases architecturally similar so training data is the most likely explanation
I feel like a lot of people here are slightly misunderstanding how LLM training works. yes the base models are trained somewhat blind on masses of text, but then they're heavily fine-tuned with custom, human-generated reinforcement learning, not just for safety, but for any desired feature
these companies do quirky one-off training experiments all the time. I would not be remotely shocked if at some point OpenAI paid some trainers to input and favour strong chess moves
Data preprocessing. The GPT-4 pretraining dataset included chess games in the format of move sequence known as Portable Game Notation (PGN). We note that only games with players of Elo 1800 or higher were included in pretraining. These games still include the moves that were played in- game, rather than the best moves in the corresponding positions. On the other hand, the chess puzzles require the model to predict the best move. We use the dataset originally introduced in Schwarzschild et al. (2021b) which is sourced from https://database.lichess.org/#puzzles (see also Schwarzschild et al., 2021a). We only evaluate the models ability to predict the first move of the puzzle (some of the puzzles require making multiple moves). We follow the pretraining for- mat, and convert each puzzle to a list of moves leading up to the puzzle position, as illustrated in Figure 14. We use 50k puzzles sampled randomly from the dataset as the training set for the weak models and another 50k for weak-to-strong finetuning, and evaluate on 5k puzzles. For bootstrap- ping (Section 4.3.1), we use a new set of 50k puzzles from the same distribution for each step of the process."
Keep in mind, everyone, that stockfish on its lowest level on lichess is absolutely terrible, and a 5-year old human who'd been playing chess for a few months could beat it regularly. It hangs pieces, does -3 blunders, totally random-looking bad moves.
But still, yes, something maybe a teeny tiny bit weird is going on, in the sense that only one of the LLMs could beat it. The arxiv paper that came out recently was much more "weird" and interesting than this, though. This will probably be met with a mundane explanation soon enough, I'd guess.
Here's a quick anonymous game against it by me, where I obliterate the poor thing in 11 moves. I was around a 1500 ELO classical strength player, which is, a teeny bit above average, globally. But I mean - not an expert, or even one of the "strong" club players (in any good club).
https://lichess.org/ -- try yourself! It's really so bad, it's good fun. Click "play with computer" on the right, then level 1 is already selected, you hit go
Definitely weird results, but I feel there are too many variables to learn much from it. A couple things:
1. The author mentioned that tokenization causes something minuscule like a a " " at the end of the input to shatter the model's capabilities. Is it possible other slightly different formatting changes in the input could raise capabilities?
2. Temperature was 0.7 for all models. What if it wasn't? Isn't there a chance one more more models would perform significantly better with higher or lower temperatures?
Maybe I just don't understand this stuff very well, but it feels like this post is only 10% of the work needed to get any meaning from this...
I’ve also been experimenting with Chess and LLMs but have taken a slightly different approach. Rather than using the LLM as an opponent, I’ve implemented it as a chess tutor to provide feedback on both the user’s and the bot’s moves throughout the game.
The responses vary with the user’s chess level; some find the feedback useful, while others do not. To address this, I’ve integrated a like, dislike, and request new feedback feature into the app, allowing users to actively seek better feedback.
Btw, different from OP's setup, I opted to input the FEN of the current board and the subsequent move in standard algebraic notation to request feedback, as I found these inputs to be clearer for the LLM compared to giving the PGN of the game.
Yeah, I was thinking why featured article's author did not use Forsyth–Edwards Notation (FEN) and more complicated chess prompts.
BTW, a year ago when I used FEN for chess playing, LLMs would very quickly/often make illegal moves. (The article prompts me to check has that changed...)
If you look at the comments under the post, the author commented 25 minutes ago (as of me posting this)
> Update: OK, I actually think I've figured out what's causing this. I'll explain in a future post, but in the meantime, here's a hint: I think NO ONE has hit on the correct explanation!
My understanding of this is the following:
All the bad models are chat models, somehow "generation 2 LLMs" which are not just text completion models but instead trained to behave as a chatting agent. The only good model is the only "generation 1 LLM" here which is gpt-3.5-turbo-instruct. It is a straight forward text completion model. If you prompt it to "get in the mind" of PGN completion then it can use some kind of system 1 thinking to give a decent approximation of the PGN Markov process. If you attempt to use a chat model it doesn't work since these these stochastic pathways somehow degenerate during the training to be a chat agent. You can however play chess with system 2 thinking, and the more advanced chat models are trying to do that and should get better at it while still being bad.
I don't think one model is statistically significant. As people have pointed out, it could have chess specific responses that the others do not. There should be at least another one or two, preferably unrelated, "good" data points before you can claim there is a pattern. Also, where's Claude?
There are other transformers that have been trained on chess text that play chess fine (just not as good as 3.5 Turbo instruct with the exception of the "grandmaster level without search" paper).
I don’t think it would have an impact great enough to explain the discrepancies you saw, but some chess engines on very low difficulty settings make “dumb” moves sometimes. I’m not great at chess and I have trouble against them sometimes because they don’t make the kind of mistakes humans make. Moving the difficulty up a bit makes the games more predictable, in that you can predict and force an outcome without the computer blowing it with a random bad move. Maybe part of the problem is them not dealing with random moves well.
I think an interesting challenge would be looking at a board configuration and scoring it on how likely it is to be real - something high ranked chess players can do without much thought (telling a random setup of pieces from a game in progress).
An easy way to make all LLMs somewhat good at chess is to make a Chess Eval that you publish and get traction with. Suddenly you will find that all newer frontier models are half decent at chess.
Ok whoah, assuming the chess powers on gpt3.5-instruct are just a result of training focus then we don't have to wait on bigger models, we just need to fine tune on 175B?
I would be very curious to know what would be the results with a temperature closer to 1. I don't really understand why he did not test the effect of different temperature on his results.
Here, basically you would like the "best" or "most probable" answer. With 0.7 you ask the llm to be more creative, meaning randomly picking between more less probable moves. This temperature is even lower to what is commonly used for chat assistant (around 0.8).
I would be interested to know if the good result is repeatable. We had a similar result with a quirky chat interface in that one run gave great results (and we kept the video) but then we couldn't do it again. The cynical among us think there was a mechanical turk involved in our good run.
The economics of venture capital means that there is enormous pressure to justify techniques that we think of as "cheating". And of course the companies involved have the resources.
Source: I'm at OpenAI and I was one of the first people to ever play chess against the GPT-4 base model. You may or may not trust OpenAI, but we're just a group of people trying earnestly to build cool stuff. I've never seen any inkling of an attempt to cheat evals or cheat customers.
It would be really cool if someone could get an LLM to actually launch an anonymous game on Chess.com or Lichess and actually have any sense as to what it’s doing.[1] Some people say that you have to represent the board in a certain way. When I first tried to play chess with an LLM, I would just list out a move and it didn’t do very well at all.
> And then I tried gpt-3.5-turbo-instruct. This is a closed OpenAI model, so details are very murky.
How do you know it didn't just write a script that uses a chess engine and then execute the script? That IMO is the easiest explanation.
Also, I looked at the gpt-3.5-turbo-instruct example victory. One side played with 70% accuracy and the other was 77%. IMO that's not on par with 27XX ELO.
The trick to getting a model to perform on something is to have it as a training data subset.
OpenAI might have thought Chess is good to optimize for but it wasn't seen as useful so they dropped it.
This is what people refer to as "lobotomy", ai models are wasting compute on knowing how loud the cicadas are and how wide the green cockroach is when mating.
Good models are about the training data you push in em
They did probably acknowledge that the additionnal cost of training those models on chess would not be "cost effective", did drop chess from their training process, for the moment.
That to say, we can literal say anything because this is very shadowy/murky, but since everything is likely a question of money... should, _probably_, be not very fair away from the truth...
I assume LLMs will be fairly average at chess for the same reason it cant count Rs in Strawberry - it's reflecting the training set and not using any underlying logic? Granted my understanding of LLMs is not very sophisticated, but I would be surprised if the Reward Models used were able to distinguish high quality moves vs subpar moves...
LLMs can't count the Rs in strawberry because of tokenization. Words are converted to vectors (numbers), so the actual transformer network never sees the letters that make up the word.
ChatGPT doesn't see "strawberry", it sees [302, 1618, 19772]
Well that makes sense when you consider the game has been translated into an (I'm assuming monotonically increasing) alphanumeric representation. So, just like language, you're given an ordered list of tokens and you need to find the next token that provides the highest confidence.
Has anyone tried to see how many chess games models are trained on? Is there any chance they consume lichess database dumps, or something similar? I guess the problem is most (all?) top LLMs, even open-weight ones, don’t reveal their training data. But I’m not sure.
For me it’s not only the chess. Chats get more chatty, but knowledge and fact-wise - it’s a sad comedy. Yes, you get a buddy to talk with, but he is talking pure nonsense.
If it was trained with moves and 100s of thousands of entire games of various level, I do see it generating good moves and beat most players except he high Elo players
So if you squint, chess can be considered a formal system. Let’s plug ZFC or PA into gpt-3.5-turbo-instruct along with an interesting theorem and see what happens, no?
"It lost every single game, even though Stockfish was on the lowest setting."
It's not playing against a GM, the prompt just phrases it this way. I couldn't pinpoint the exact ELO of "lowest" stockfish settings, but it should be roughly between 1000 and 1400, which is far from professional play.
perhaps my understanding of LLM is quite shallow, but instead of the current method of using statistical methods, would it be possible to somehow train GPT how to reason by providing instructions on deductive reasoning? perhaps not semantic reasoning but syntactic at least?
My guess is they just trained gpt3.5-turbo-instruct on a lot of chess, much more than is in e.g. CommonCrawl, in order to boost it on that task. Then they didn't do this for other models.
People are alleging that OpenAI is calling out to a chess engine, but seem to be not considering this less scandalous possibility.
Of course, to the extent people are touting chess performance as evidence of general reasoning capabilities, OpenAI taking costly actions to boost specifically chess performance and not being transparent about it is still frustrating and, imo, dishonest.
my friend pointed out that Q5_K_M quantization used for the open source models probably substantially reduces the quality of play. o1 mini's poor performance is puzzling, though.
Is it just me or does the author swap descriptions of the instruction finetuned and the base gpt-3.5-turbo?
It seemed like the best model was labeled instruct, but the text saying instruct did worse?
if this isn't just a bad result, it's odd to me that the author at no point suggests what sounds to me like the most obvious answer - that OpenAI has deliberately enhanced GPT-3.5-turbo-instruct's chess playing, either with post-processing or literally by training it to be so
All of the LLM models tested playing chess performed terribly bad against Stockfish engine except gpt-3.5-turbo-instruct, which is a closed OpenAI model.
If tokenization is such a big problem, then why aren't we training new base models on randomly non-tokenized data? e.g. during training, randomly substitute some percentage of the input tokens with individual letters.
LLMs aren't really language models so much as they are token models. That is how they can also handle input in audio or visual forms because there is an audio or visual tokenizer. If you can make it a token, the model will try to predict the following ones.
Even though I'm sure chess matches were used in some of the LLM training, I'd bet a model trained just for chess would do far better.
I feel like the article neglects one obvious possibility: that OpenAI decided that chess was a benchmark worth "winning", special-cases chess within gpt-3.5-turbo-instruct, and then neglected to add that special-case to follow-up models since it wasn't generating sustained press coverage.
This is exactly it. Here’s the pull request where chess evals were added: https://github.com/openai/evals/pull/45.
I suspect the same thing. Rather than LLMs “learning to play chess,” they “learnt” to recognise a chess game and hand over instructions to a chess engine. If that’s the case, I don’t feel impressed at all.
This is exactly what I feel AI needs. A manager AI that then hands off things to specialized more deterministic algorithms/machines.
7 replies →
That's something completely different than what the OP suggests and would be a scandal if true (i.e. gpt-3.5-turbo-instruct actually using something else behind the scenes).
14 replies →
TBH I think a good AI would have access to a Swiss army knife of tools and know how to use them. For example a complicated math equation, using a calculator is just smarter than doing it in your head.
8 replies →
Recognize and hand over to a specialist engine? That might be useful for AI. Maybe I am missing something.
8 replies →
That's not much different from a compiler being rigged to recognize a specific benchmark program and spit out a canned optimization.
1 reply →
This seems quite likely to me, but did they special case it by reinforcement training it into the LLM (which would be extremely interesting in how they did it and what its internal representation looks like) or is it just that when you make an API call to OpenAI, the machine on the other end is not just a zillion-parameter LLM but also runs an instance of Stockfish?
That's easy to test, invent a new chess variant and see how the model does.
20 replies →
Of course it's a benchmark worth winning, has been since Watson. And before that even with mechanical turks.
To be fair, they say
> Theory 2: GPT-3.5-instruct was trained on more chess games.
If that were the case, pumping big Llama chock full of chess games would produce good results. It didn't.
The only way it could be true is if that model recognized and replayed the answer to the game from memory.
1 reply →
Why couldn't they add a tool that literally calls stockfish or a chess ai behind the scenes with function calling and buffer the request before sending it back to the endpoint output interface?
As long as you are training it to make a tool call, you can add and remove anything you want behind the inference endpoint accessible to the public, and then you can plug the answer back into the chat ai, pass it through a moderation filter, and you might get good output from it with very little latency added.
Yes, came here to say exactly this. And it's possible this specific model is "cheating", for example by identifying a chess problem and forwarding it to a chess engine. A modern version of the Mechanical Turk.
That's the problem with closed models, we can never know what they're doing.
Maybe they even delegate it to a chess engine internally via the tool use and the LLM uses that.
Important testing excerpts:
- "...for the closed (OpenAI) models I tried generating up to 10 times and if it still couldn’t come up with a legal move, I just chose one randomly."
- "I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization"
- "...if I gave a prompt like “1. e4 e5 2. ” (with a space at the end), the open models would play much, much worse than if I gave a prompt like “1 e4 e5 2.” (without a space)"
- "I used a temperature of 0.7 for all the open models and the default for the closed (OpenAI) models."
Between the tokenizer weirdness, temperature, quantization, random moves, and the chess prompt, there's a lot going on here. I'm unsure how to interpret the results. Fascinating article though!
Ah, buried in the post-article part. I was wondering how all of the models were seemingly capable of making legal moves, since last I saw something about LLMs playing Chess they were very much not capable of that.
Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.
I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.
Surprised I don't see more research into radicaly different tokenization.
FWIW I think most of the "tokenization problems" are in fact reasoning problems being falsely blamed on a minor technical thing when the issue is much more profound.
E.g. I still see people claiming that LLMs are bad at basic counting because of tokenization, but the same LLM counts perfectly well if you use chain-of-thought prompting. So it can't be explained by tokenization! The problem is reasoning: the LLM needs a human to tell it that a counting problem can be accurately solved if they go step-by-step. Without this assistance the LLM is likely to simply guess.
The more obvious alternative is that CoT is making up for the deficiencies in tokenization, which I believe is the case.
5 replies →
I’m the one who will fight you including with peer reviewed papers indicating that it is in fact due to tokenization. I’m too tired but will edit this for later, so take this as my bookmark to remind me to respond.
6 replies →
At a certain level they are identical problems. My strongest piece of evidence is that I get paid as an RLHF'er to find ANY case of error, including "tokenization". You know how many errors an LLM gets in the simplest grid puzzles, with CoT, with specialized models that don't try to "one-shot" problems, with multiple models, etc?
My assumption is that these large companies wouldn't pay hundreds of thousands of RLHF'ers through dozens of third party companies livable wages if tokenization errors were just that.
2 replies →
FWIW I think most of the "tokenization problems"
List of actual tokenizarion limitations 1- strawberry 2- rhyming and metrics 3- whitespace (as displayed in the article)
It can count words in a paragraph though. So I do think it's tokenization.
I feel like we can set our qualifying standards higher than counting.
I think it's infeasible to train on bytes unfortunately, but yeah it also seems very wrong to use a handwritten and ultimately human version of tokens (if you take a look at the tokenizers out there you'll find fun things like regular expressions to change what is tokenized based on anecdotal evidence).
I keep thinking that if we can turn images into tokens, and we can turn audio into tokens, then surely we can create a set of tokens where the tokens are the model's own chosen representation for semantic (multimodal) meaning, and then decode those tokens back to text[1]. Obviously a big downside would be that the model can no longer 1:1 quote all text it's seen since the encoded tokens would need to be decoded back to text (which would be lossy).
[1] From what I could gather, this is exactly what OpenAI did with images in their gpt-4o report, check out "Explorations of capabilities": https://openai.com/index/hello-gpt-4o/
There’s a reason human brains have dedicated language handling. Tokenization is likely a solid strategy. The real thing here is that language is not a good way to encode all forms of knowledge
It's not even possible to encode all forms of knowledge.
1 reply →
https://youtu.be/zduSFxRajkE
karpathy agrees with you, here he is hating on tokenizers while re-building them for 2h
Going from tokens to bytes explodes the model size. I can’t find the reference at the moment, but reducing the average token size induces a corresponding quadratic increase in the width (size of each layer) of the model. This doesn’t just affect inference speed, but also training speed.
I tend to agree with you. Your post reminded me of https://gwern.net/aunn
One neat thing about the AUNN idea is that when you operate at the function level, you get sort of a neural net version of lazy evaluation; in this case, because you train at arbitrary indices in arbitrary datasets you define, you can do whatever you want with tokenization (as long as you keep it consistent and don't retrain the same index with different values). You can format your data in any way you want, as many times as you want, because you don't have to train on 'the whole thing', any more than you have to evaluate a whole data structure in Haskell; you can just pull the first _n_ elements of an infinite list, and that's fine.
So there is a natural way to not just use a minimal bit or byte level tokenization, but every tokenization simultaneously: simply define your dataset to be a bunch of datapoints which are 'start-of-data token, then the byte encoding of a datapoint followed by the BPE encoding of that followed by the WordPiece encoding followed by ... until the end-of-data token'.
You need not actually store any of this on disk, you can compute it on the fly. So you can start by training only on the byte encoded parts, and then gradually switch to training only on the BPE indices, and then gradually switch to the WordPiece, and so on over the course of training. At no point do you need to change the tokenization or tokenizer (as far as the AUNN knows) and you can always switch back and forth or introduce new vocabularies on the fly, or whatever you want. (This means you can do many crazy things if you want. You could turn all documents into screenshots or PDFs, and feed in image tokens once in a while. Or why not video narrations? All it does is take up virtual indices, you don't have to ever train on them...)
Perhaps we can even do away with transformers and use a fully connected network. We can always prune the model later ...
A byte is itself sort of a token. So is a bit. It makes more sense to use more tokenizers in parallel than it does to try and invent an entirely new way of seeing the world.
Anyway humans have to tokenize, too. We don't perceive the world as a continuous blob either.
I would say that "humans have to tokenize" is almost precisely the opposite of how human intelligence works.
We build layered, non-nested gestalts out of real time analog inputs. As a small example, the meaning of a sentence said with the same precise rhythm and intonation can be meaningfully changed by a gesture made while saying it. That can't be tokenized, and that isn't what's happening.
1 reply →
How would we train it? Don't we need it to understand the heaps and heaps of data we already have "tokenized" e.g. the internet? Written words for humans? Genuinely curious how we could approach it differently?
Couldn't we just make every human readable character a token?
OpenAI's tokenizer makes "chess" "ch" and "ess". We could just make it into "c" "h" "e" "s" "s"
10 replies →
That's not what tokenized means here. Parent is asking to provide the model with separate characters rather than tokens, i.e. groups of characters.
Tokenization is not strictly speaking necessary (you can train on bytes). What it is is really really efficient. Scaling is a challenge as is, bytes would just blow that up.
This is probably unnecessary, but: I wish you wouldn't use the word "stupid" there. Even if you didn't mean anything by it personally, it might reinforce in an insecure reader the idea that, if one can't speak intelligently about some complex and abstruse subject that other people know about, there's something wrong with them, like they're "stupid" in some essential way. When in fact they would just be "ignorant" (of this particular subject). To be able to formulate those questions at all is clearly indicative of great intelligence.
> This is probably unnecessary
you're certainly right
1 reply →
I think on the contrary, the more you can restrict it to reasonable inputs/outputs, the less powerful LLM you are going to need.
hot take: LLM tokens is kanji for AI, and just like kanji it works okay sometimes but fails miserably for the task of accurately representating English
Why couldn’t Chinese characters accurately represent English? Japanese and Korean aren’t related to Chinese and still were written with Chinese characters (still are in the case of Japanese).
If England had been in the Chinese sphere of influence rather than the Roman one, English would presumably be written with Chinese characters too. The fact that it used an alphabet instead is a historical accident, not due to any grammatical property of the language.
12 replies →
You could absolutely write a tokenizer that would consistently tokenize all distinct English words as distinct tokens, with a 1:1 mapping.
But AFAIK there's no evidence that this actually improves anything, and if you spend that much of the dictionary on one language, it comes at the cost of making the encoding for everything else much less efficient.
2 replies →
I have seen a bunch of tokenization papers with various ideas but their results are mostly meh. I personally don't see anything principally wrong with current approaches. Having discrete symbols is how natural language works, and this might be an okayish approximation.
It's probably worth to play around with different prompts and different board positions.
For context this [1] is the board position the model is being prompted on.
There may be more than one weird thing about this experiment, for example giving instructions to the non-instruction tuned variants may be counter productive.
More importantly let's say you just give the model the truncated PGN, does this look like a position where white is a grandmaster level player? I don't think so. Even if the model understood chess really well it's going to try to predict the most probable move given the position at hand, if the model thinks that white is a bad player, and the model is good at understanding chess, it's going to predict bad moves as the more likely ones because that would better predict what is most likely to happen here.
[1]: https://i.imgur.com/qRxalgH.png
Apparently I can find some matches for games that start like that between very strong players [1], so my hypothesis that the model may just be predicting bad moves on purpose seems wobbly, although having stockfish at the lowest level play as the supposedly very strong opponent may still be throwing the model off somewhat. In the charts the first few moves the model makes seem decent, if I'm interpreting these charts right, and after a few of those things seem to start going wrong.
Either way it's worth repeating the experiment imo, tweaking some of these variables (prompt guidance, stockfish strength, starting position, the name of the supposed players, etc.).
[1]: https://www.365chess.com/search_result.php?search=1&p=1&m=8&...
Interesting thought the LLM isn’t trying to win, it’s trying to produce data like the input data. It’s quite rare for a very strong player to play a very weak one. If you feed it lots of weak moves it’ll best replicate the training data by following with weak moves.
The experiment started from the first move of a game, and played each game fully. The position you linked was just an example of the format used to feed the game state to the model for each move.
What would "winning" or "losing" even mean if all of this was against a single move?
Agree with this. A few prompt variants:
* What if you allow the model to do Chain of Thought (explicitly disallowed in this experiment)
* What if you explain the board position at each step to the model in the prompt, so it doesn't have to calculate/estimate it internally.
They also tested GPT-o1, which is always CoT. Yet it is still worse.
He was playing full games, not single moves.
Does it ever try an illegal move? OP didn't mention this and I think it's inevitable that it should happen at least once, since the rules of chess are fairly arbitrary and LLMs are notorious for bullshitting their way through difficult problems when we'd rather they just admit that they don't have the answer.
In my experience you are lucky if it manages to give you 10 legal moves in a row, e.g. https://news.ycombinator.com/item?id=41527143#41529024
Yes, he discusses using a grammar to restrict to only legal moves
I suspect the models probably memorized some chess openings, and afterwards they are just playing random moves with the help of the grammar.
1 reply →
> he discusses using a grammar to restrict to only legal moves
Whether a chess move is legal isn't primarily a question of grammar. It's a question of the board state. "White king to a5" is a perfectly legal move, as long as the white king was next to a5 before the move, and it's white's turn, and there isn't a white piece in a5, and a5 isn't threatened by black. Otherwise it isn't.
"White king to a9" is a move that could be recognized and blocked by a grammar, but how relevant is that?
Still an interesting direction of questioning. Maybe could be rephrased as "how much work is the grammar doing"? Are the results with the grammar very different than without? If/when a grammar is not used (like in the openai case), how many illegal moves does it try on average before finding a legal one?
4 replies →
I don't understand why educated people expect that an LLM would be able to play chess at a decent level.
It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.
Then you should be surprised that turbo-instruct actually plays well, right? We see a proliferation of hand-wavy arguments based on unfounded anthropomorphic intuitions about "actual reasoning" and whatnot. I think this is good evidence that nobody really understands what's going on.
If some mental model says that LLMs should be bad at chess, then it fails to explain why we have LLMs playing strong chess. If another mental model says the inverse, then it fails to explain why so many of these large models fail spectacularly at chess.
Clearly, there's more going on here.
There are some who suggest that modern chess is mostly a game of memorization and not one particularly of strategy or skill. I assume this is why variants like speed chess exist.
In this scope, my mental model is that LLMs would be good at modern style long form chess, but would likely be easy to trip up with certain types of move combinations that most humans would not normally use. My prediction is that once found they would be comically susceptible to these patterns.
Clearly, we have no real basis for saying it is "good" or "bad" at chess, and even using chess performance as an measurement sample is a highly biased decision, likely born out of marketing rather than principle.
3 replies →
> Then you should be surprised that turbo-instruct actually plays well, right?
Do we know it's not special-casing chess and instead using a different engine (not an LLM) for playing?
To be clear, this would be an entirely appropriate approach to problem-solving in the real world, it just wouldn't be the LLM that's playing chess.
Yes, probably there is more going on here, e.g. it is cheating.
"playing strong chess" would be a much less hand-wavy claim if there were lots of independent methods of quantifying and verifying the strength of stockfish's lowest difficulty setting. I honestly don't know if that exists or not. But unless it does, why would stockfish's lowest difficulty setting be a meaningful threshold?
1 reply →
But to some approximation we do know how an LLM plays chess.. based on all the games, sites, blogs, analysis in its training data. But it has a limited ability to tell a good move from a bad move since the training data has both, and some of it lacks context on move quality.
Here's an experiment: give an LLM a balanced middle game board position and ask it "play a new move that a creative grandmaster has discovered, never before played in chess and explain the tactics and strategy behind it". Repeat many times. Now analyse each move in an engine and look at the distribution of moves and responses. Hypothesis: It is going to come up with a bunch of moves all over the ratings map with some sound and some fallacious arguments.
I really don't think there's anything too mysterious going on here. It just synthesizes existing knowledge and gives answers that includes bit hits, big misses and everything in between. Creators chip away at the edges to change that distribution but the fundamental workings don't change.
One of the main purposes of running experiments of any sort is to find out if our preconceptions are accurate. Of course, if someone is not interested in that question, they might as well choose not to look through the telescope.
Sadly there’s a common sentiment on HN that testing obvious assumptions is a waste of time
4 replies →
This is a puzzle given enough training information. LLM can successfully print out the status of the board after the given moves. It can also produce a not-terrible summary of the position and is able to list dangers at least one move ahead. Decent is subjective, but that should beat at least beginners. And the lowest level of stockfish used in the blog post is lowest intermediate.
I don't know really what level we should be thinking of here, but I don't see any reason to dismiss the idea. Also, it really depends on whether you're thinking of the current public implementations of the tech, or the LLM idea in general. If we wanted to get better results, we could feed it way more chess books and past game analysis.
LLMs like GPT aren’t built to play chess, and here’s why: they’re made for handling language, not playing games with strict rules and strategies. Chess engines, like Stockfish, are designed specifically for analyzing board positions and making the best moves, but LLMs don’t even "see" the board. They’re just guessing moves based on text patterns, without understanding the game itself.
Plus, LLMs have limited memory, so they struggle to remember previous moves in a long game. It’s like trying to play blindfolded! They’re great at explaining chess concepts or moves but not actually competing in a match.
30 replies →
Stockfish level 1 is well below "lowest intermediate".
A friend of mine just started playing chess a few weeks ago and can beat it about 25% of the time.
It will hang pieces, and you can hang your own queen and there's about a 50% chance it won't be taken.
Because it's a straight forward stochastic sequence modelling task and I've seen GPT-3.5-turbo-instruct play at high amateur level myself. But it seems like all the RLHF and distillation that is done on newer models destroys that ability.
Question here is why gpt-3.5-instruct can then beat stockfish.
PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close "Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00" https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...
5 replies →
Cheating (using a internal chess engine) would be the obvious reason to me.
6 replies →
The artical appears to have only run stockfish at low levels. you don't have to be very good to beat it
I'm actually surprised any of them manage to make legal moves throughout the game once out of book moves.
> I don't understand why educated people expect that an LLM would be able to play chess at a decent level.
Because it would be super cool; curiosity isn't something to be frowned upon. If it turned out it did play chess reasonably well, it would mean emergent behaviour instead of just echoing things said online.
But it's wishful thinking with this technology at this current level; like previous instances of chatbots and the like, while initially they can convince some people that they're intelligent thinking machines, this test proves that they aren't. It's part of the scientific process.
turbo instruct does play chess reasonably well.
https://github.com/adamkarvonen/chess_gpt_eval
Even the blog above says as much.
They thought it because we have an existence proof: gpt-3.5-turbo-instruct can play chess at a decent level.
That was the point of the post (though you have to read it to the end to see this). That one model can play chess pretty well, while the free models and OpenAI's later models can't. That's weird.
I suppose you didn't get the news, but google developed a LLM that can play chess. And play it at grandmaster level: https://arxiv.org/html/2402.04494v1
That article isn't as impressive as it sounds: https://gist.github.com/yoavg/8b98bbd70eb187cf1852b3485b8cda...
In particular, it is not an LLM and it is not trained solely on observations of chess moves.
Not quite an LLM. It's a transformer model, but there's no tokenizer or words, just chess board positions (64 tokens, one per board square). It's purpose-built for chess (never sees a word of text).
1 reply →
It's interesting to note that the paper benchmarked its chess playing performance against GPT-3.5-turbo-instruct, the only well performant LLM in the posted article.
Right, at least as of the ~GPT3 model it was just "predict what you would see in a chess game", not "what would be the best move". So (IIRC) users noted that if you made bad move, then the model would also reply with bad moves because it pattern matched to bad games. (I anthropomorphized this as the model saying "oh, we're doing dumb-people-chess now, I can do that too!")
But it also predicts moves where the text says "black won the game, [proceeds to show the game]". To minimize loss on that it would need to from context try and make it so white doesn't make critical mistakes.
I love how LLMs are the one subject matter where even most educated people are extremely confidently wrong.
Ppl acting like LLMs!
Chess does not clearly require that. Various purely ML/statistical based model approaches are doing pretty well. It's almost certainly best to incorporate some kind of search into an overall system, but it's not absolutely required to play just decent amateur level.
The problem here is the specific model architecture, training data, vocabulary/tokenization method (if you were going to even represent a game this way... which you wouldn't), loss function and probably decoding strategy.... basically everything is wrong here.
It'd be more interesting to see LLMs play Family Feud. I think it'd be their ideal game.
> I don't understand why educated people expect that an LLM would be able to play chess at a decent level.
You shouldn't but there's lots of things that LLMs can do that educated people shouldn't expect it to be able to do.
> I don't understand why educated people expect that an LLM would be able to play chess at a decent level.
The blog post demonstrates that a LLM plays chess at a decent level.
The blog post explains why. It addresses the issue of data quality.
I don't understand what point you thought you were making. Regardless of where you stand, the blog post showcases a surprising result.
You stress your prior unfounded belief, you were presented with data that proves it wrong, and your reaction was to post a comment with a thinly veiled accusation of people not being educated when clearly you are the one that's off.
To make matters worse, this topic is also about curiosity. Which has a strong link with intelligence and education. And you are here criticizing others on those grounds in spite of showing your defitic right at the first sentence.
This blog post was a great read. Very surprising, engaging, and thought provoking.
The only service performing well is a closed source one that could simply use a real chess engine for questions that look like chess, for marketing purposes. There’s nothing thought provoking about a bunch of engineers doing “experiments” against a service, other than how sad it is to debase themselves in this way.
1 reply →
There are many ways to test for reasoning and deterministic computation as my own work in this space has shown .
But there's really nothing about chess that makes reasoning a prerequisite, a win is a win as long as it's a win. This is kind of a semantics game: it's a question of whether the degree of skill people observe in an LLM playing chess is actually some different quantity than the chance it wins.
I mean at some level you're saying that no matter how close to 1 the win probability (1 - epsilon) gets, both of the following are true:
A. you should always expect for the computation that you're able to do via conscious reasoning alone to always be sufficient, at least in principle, to asymptotically get a higher win probability than a model, no matter what the model's win probability was to begin with
B. no matter how close to 1 that the model's win rate p=(1 - epsilon) gets, because logical inference is so non-smooth, the win rate on yet-unseen data is fundamentally algorithmically random/totally uncorrelated to in-distribution performance, so it's never appropriate to say that a model can understand or to reason
To me it seems that people are subject to both of these criteria, though. They have a tendency to cap out at their eventual skill cap unless given a challenge to nudge them to a higher level, and likewise possession of logical reasoning doesn't let us say much at all about situations that their reasoning is unfamiliar with.
I also think, if you want to say that what LLMs do has nothing to do with understanding or ability, then you also have to have an alternate explanation for the phenomenon of AlphaGo defeating Lee Sedol being a catalyst for top Go players being able to rapidly increase their own rankings shortly after.
Few people (perhaps none) expected LLMs to be good at chess. Nevertheless, as the article explains, there was buzz around a year ago that LLMs were good at chess.
> It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.
No. You can definitely train a model to be really good at chess without "actual reasoning and deterministic computation".
Yeah, that is the "something weird" of the article.
Bro, it actually did play chess, didn't you read the article?
It sorta played chess- he let it generate up to ten moves, throwing away any that weren't legal, and if no legal move was generated by the 10th try he picked a random legal move. He does not say how many times he had to provide a random move, or how many times illegal moves were generated.
3 replies →
> I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization, whatever that is.
It's just a lossy compression of all of the parameters, probably not important, right?
Probably important when competing against undecimated ones from OpenAI
Notably: there were other OpenAI models that weren't quantize, but also performed poorly.
i think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good. even a trillion games won't save you: https://en.wikipedia.org/wiki/Shannon_number
that said, for the sake of completeness, modern chess engines (with high quality chess-specific models as part of their toolset) are fully capable of, at minimum, tying every player alive or dead, every time. if the opponent makes one mistake, even very small, they will lose.
while writing this i absently wondered if you increased the skill level of stockfish, maybe to maximum, or perhaps at least an 1800+ elo player, you would see more successful games. even then, it will only be because the "narrower training data" (ie advanced players won't play trash moves) at that level will probably get you more wins in your graph, but it won't indicate any better play, it will just be a reflection of less noise; fewer, more reinforced known positions.
> i think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good. even a trillion games won't save you: https://en.wikipedia.org/wiki/Shannon_number
Indeed. As has been pointed out before, the number of possible chess positions easily, vastly dwarfs even the wildest possible estimate of the number of atoms in the known universe.
Sure, but so does the number of paragraphs in the english language, and yet LLMs seem to do pretty well at that. I don't think the number of configurations is particularly relevant.
(And it's honestly quite impressive that LLMs can play it at all, but not at all surprising that it loses pretty handily to something which is explicitly designed to search, as opposed to simply feed-forward a decision)
Not true if we’re talking sensible chess moves.
What about the number of possible positions where an idiotic move hasn't been played? Perhaps the search space who could be reduced quite a bit.
1 reply →
Since we're mentioning Shannon... What is the minimum representative sample size of that problem space? Is it close enough to the number of freely available chess moves on the Internet and in books?
> I think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good.
Yeah, once you've deviated from a sequence you're lost.
Maybe approaching it by learning the best move in billions/trillions of positions, and feeding that into some AI could work better. Similar positions often have the same kind of best move.
Honestly, I think that once you discard the moves one would never make, and account for symmetries/effectively similar board positions (ones that could be detected by a very simple pattern matcher), chess might not be that big a game at all.
you should try it and post a rebuttal :)
I found a related set of experiments that include gpt-3.5-turbo-instruct, gpt-3.5-turbo and gpt-4.
Same surprising conclusion: gpt-3.5-turbo-instruct is much better at chess.
https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/
I’d bet it’s using function calling out to a real chess engine. It could probably be proven with a timing analysis to see how inference time changes/doesn’t with number of tokens or game complexity.
?? why would openai even want to secretly embed chess function calling into an incredibly old model? if they wanted to trick people into thinking their models are super good at chess why wouldn't they just do that to gpt-4o?
1 reply →
If it were calling to a real chess engine there would be no illegal moves.
1 reply →
OpenAI has a TON of experience making game-playing AI. That was their focus for years, if you recall. So it seems like they made one model good at chess to see if it had an overall impact on intelligence (just as learning chess might make people smarter, or learning math might make people smarter, or learning programming might make people smarter)
Playing is a thing strongly related to abstract representation of the game in game states. Even if player does not realize it, with chess it’s really about shallow or beam search within the possible moves.
LLMs don’t do reasoning or exploration, but they write text based on precious text. So to us it may seem playing, but is really a smart guesswork based on previous games. It’s like Kasparov writing moves without imagining the actual placement.
What would be interesting is to see whether a model, given only the rules, will play. I bet it won’t.
At this moment it’s replaying by memory but definitely not chasing goals. There’s no such think as forward attention yet, and beam search is expensive enough, so one would prefer to actually fallback to classic chess algos.
I think you're confusing OpenAI and DeepMind.
OpenAI has never done anything except conversational agents.
Very wrong. The first time most people here probably heard about OpenAI back in 2017 or so was their DotA 2 bot.
https://en.wikipedia.org/wiki/OpenAI_Five
https://openai.com/index/gym-retro/
> OpenAI has never done anything except conversational agents.
Tell me you haven't been following this field without telling me you haven't been following this field[0][1][2]?
[0]: https://github.com/openai/gym
[1]: https://openai.com/index/jukebox/
[2]: https://openai.com/index/openai-five-defeats-dota-2-world-ch...
They definitely have game-playing AI expertise, though: https://noambrown.github.io/
No, they started without conversation and only reinforcement learning on games, directly comparable to DeepMind.
“In the summer of 2018, simply training OpenAI's Dota 2 bots required renting 128,000 CPUs and 256 GPUs from Google for multiple weeks.”
At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training. That's not something specific to LLMs or OpenAI. Compiler companies have done the same thing for decades, specifically detecting common benchmark programs and inserting hand-crafted optimizations. Similarly, the shader compilers in GPU drivers have special cases for common games and benchmarks.
VW got in a lot of trouble for this
Apples and oranges. VW actually cheated on regulatory testing to bypass legal requirements. So to be comparable, the government would first need to pass laws where e.g. only compilers that pass a certain benchmark are allowed to be used for purchasable products and then the developers would need to manipulate behaviour during those benchmarks.
36 replies →
True. But they did not optimize for a specific case. They detected the test and then enabled a special regime, that was not used normally.
It’s as if OpenAI detects the IP address from a benchmark organization, and then used a completely different model.
1 reply →
Most of the time these days compiler writers are not cheating like VW did. In the 1980s compiler writers would insert code to recognize performance tests and then cheat - output values hard coded into the compiler instead of running the algorithm. Which is the type of thing that VW got in trouble for.
These days most compilers are trying to make the general case of code fast and they rarely look for benchmarks. I won't say they never do this - just that it is much less common - if only because magazine reviews/benchmarks are not nearly as important as they used to be and so the incentive is gone.
Actually performing well on a task that is used as a benchmark is not comparable to decieving authorities about how much toxic gas you are releasing.
Only because what VW did is illegal, was super large scale, and could be linked to a lot of indirect deaths through the additional pollution.
Benchmark optimizations are slightly embarrassing at worst, and an "optimization for a specific use case" at best. There's no regulation against optimizing for a particular task, everyone does it all the time, in some cases it's just not communicated transparently.
Phone manufacturers were caught "optimizing" for benchmarks again and again, removing power limits to boost scores. Hard to name an example without searching the net because it's at most a faux pas.
GPT-3.5 did not “cheat” on chess benchmarks, though, it was actually just better at chess?
2 replies →
Not quite. VW got in trouble for running _different_ software in test vs prod. These optimizations are all going to "prod" but are only useful for specific targets (a specific game in this case).
3 replies →
It’s approximately bad, like most of ML
On one side:
Would you expect a model trained on no Spanish data to do well on Spanish?
On the other:
Is it okay to train on the MMLU test set?
This is 10 year old story. It’s very interesting which ones stay in the public consciousness.
Funny response; you're not wrong.
We detached this subthread from https://news.ycombinator.com/item?id=42144784.
(Nothing wrong with it! It's just a bit more generic than the original topic.)
Can you try increasing compute in the problem search space, not in the training space? What this means is, give it more compute to think during inference by not forcing any model to "only output the answer in algebraic notation" but do CoT prompting: "1. Think about the current board 2. Think about valid possible next moves and choose the 3 best by thinking ahead 3. Make your move"
Or whatever you deem a good step by step instruction of what an actual good beginner chess player might do.
Then try different notations, different prompt variations, temperatures and the other parameters. That all needs to go in your hyper-parameter-tuning.
One could try using DSPy for automatic prompt optimization.
> 1. Think about the current board 2. Think about valid possible next moves and choose the 3 best by thinking ahead 3.
Do these models actually think about a board? Chess engines do, as much as we can say that any machine thinks. But do LLMs?
Can be forced through inference with CoT type of stuff. Spend tokens at each stage to draw the board for example, then spend tokens restating the rules of the game, then spend token restating the heuristics like piece value, and then spend tokens doing a minmax n-ply search.
Wildly inefficient? Probably. Could maybe generate some python to make more efficient? Maybe, yeah.
Essentially user would have to teach gpt to play chess, or training would fine tune chess towards these CoT, fine tuning, etc...
Yeah, the expectation for the immediate answer is definitely results, especially for the later stages. Another possible improvement: every 2 steps, show the current board state and repeat the moves still to be processed, before analysing the final position.
Maybe that one which plays chess well is calling out to a real chess engine.
It's not:
1. That would just be plain bizzare
2. It plays like what you'd expect from a LLM that could play chess. That is, level of play can be modulated by the prompt and doesn't manifest the same way shifting the level of stockfish etc does. Also the specific chess notation being prompted actually matters
3. It's sensitive to how the position came to be. Clearly not an existing chess engine. https://github.com/dpaleka/llm-chess-proofgame
4. It does make illegal moves. It's rare (~5 in 8205) but it happens. https://github.com/adamkarvonen/chess_gpt_eval
5. You can or well you used to be able to inspect the logprobs. I think Open AI have stopped doing this but the link in 4 does show the author inspecting it for Turbo instruct.
[dead]
> Also the specific chess notation being prompted actually matters
Couldn't this be evidence that it is using an engine? Maybe if you use the wrong notation it relies on the ANN rather than calling to the engine.
Likewise:
- The sensitivity to game history is interesting, but is it actually true that other chess engines only look at current board state? Regardless, maybe it's not an existing chess engine! I would think OpenAI has some custom chess engine built as a side project, PoC, etc. In particular this engine might be neural and trained on actual games rather than board positions, which could explain dependency on past moves. Note that the engine is not actually very good. Does AlphaZero depend on move history? (Genuine question, I am not sure. But it does seem likely.)
- I think the illegal moves can be explained similarly to why gpt-o1 sometimes screws up easy computations despite having access to Python: an LLM having access to a tool does not guarantee it always uses that tool.
I realize there are holes in the argument, but I genuinely don't think these holes are as big as the "why is gpt-3.5-turbo-instruct so much better at chess than gpt-4?"
1 reply →
The author thinks this is unlikely because it only has an ~1800 ELO. But OpenAI is shady as hell, and I could absolutely see the following purely hypothetical scenario:
- In 2022 Brockman and Sutskever have an unshakeable belief that Scaling Is All You Need, and since GPT-4 has a ton of chess in its pretraining data it will definitely be able to play competent amateur chess when it's finished.
- A ton of people have pointed out that ChatGPT-3.5 doesn't even slightly understand chess despite seeming fluency in the lingo. People start to whisper that transformers cannot actually create plans.
- Therefore OpenAI hatches an impulsive scheme: release an "instruction-tuned" GPT-3.5 with an embedded chess engine that is not a grandmaster, but can play competent chess, ideally just below the ELO that GPT-4 is projected to have.
- Success! The waters are muddied: GPT enthusiasts triumphantly announce that LLMs can play chess, it just took a bit more data and fine-tuning. The haters were wrong: look at all the planning GPT is doing!
- Later on, at OpenAI HQ...whoops! GPT-4 sucks at chess, as do competitors' foundation LLMs which otherwise outperform GPt-3.5. The scaling "laws" failed here, since they were never laws in the first place. OpenAI accepts that scaling transformers won't easily solve the chess problem, then realizes that if they include the chess engine with GPT-4 without publicly acknowledging it, then Anthropic and Facebook will call out the performance as aberrational and suspicious. But publicly acknowledging a chess engine is even worse: the only reason to include the chess engine is to mislead users into thinking GPT is capable of general-purpose planning.
- Therefore in later GPT versions they don't include the engine, but it's too late to remove it from gpt-3.5-turbo-instruct: people might accept the (specious) claim that GPT-4's size accidentally sabotaged its chess abilities, but they'll ask tough questions about performance degradation within the same model.
I realize this is convoluted and depends on conjecture. But OpenAI has a history with misleading demos - e.g. their Rubik's cube robot which in fact used a classical algorithm but was presented as reinforcement learning. I think "OpenAI lied" is the most likely scenario. It is far more likely than "OpenAI solved the problem honestly in GPT-3.5, but forgot how they did it with GPT-4," and a bit more likely than "scaling transformers slightly helps performance when playing Othello but severely sabotages performance when playing chess."
Eh, OpenAI really isn't as shady as hell, from what I've seen on the inside for 3 years. Rubik's cube hand was before me, but in my time here I haven't seen anything I'd call shady (though obviously the non-disparagement clauses were a misstep that's now been fixed). Most people are genuinely trying to build cool things and do right by our customers. I've never seen anyone try to cheat on evals or cheat customers, and we take our commitments on data privacy seriously.
I was one of the first people to play chess against the base GPT-4 model, and it blew my mind by how well it played. What many people don't realize is that chess performance is extremely sensitive to prompting. The reason gpt-3.5-turbo-instruct does so well is that it can be prompted to complete PGNs. All the other models use the chat format. This explains pretty much everything in the blog post. If you fine-tune a chat model, you can pretty easily recover the performance seen in 3.5-turbo-instruct.
There's nothing shady going on, I promise.
Very good scenario. One variation: some researcher or division in OpenAI performs all of the above steps to get a raise. The whole field is predicated on rewarding the appearance of ability.
Not that convoluted really
2 replies →
This is likely. From example games, it not only knows the rules (which would be impressive by itself, just making the legal moves is not trivial). It also has some planning capabilities (plays combinations of several moves).
Probably not calling out to one but it would not surprise me at all if they added more chess PGNs into their training data. Chess is a bit special in AI in that it’s still seen as a mark of pure intelligence in some respect.
If you tested it on an equally strategic but less popular game I highly doubt you would see the same performance.
I think that's the most plausible theory that would explain the sudden hike from gpt-3.5-turbo to gpt-3.5-turbo-instruct, and again the sudden regression in gpt-4*.
OpenAI also seem to augment the LLM with some type of VM or a Python interpreter. Maybe they run a simple chess engine such as Sunfish [1] which is around 1900-2000 ELO [2]?
[1] https://github.com/thomasahle/sunfish
[2] https://lichess.org/@/sunfish-engine
this possibility is discussed in the article and deemed unlikely
Note: the possibility is not mentioned in the article but rather in the comments [1]. I had to click a bit to see it.
The fact that the one closed source model is the only one that plays well seems to me like a clear case of the interface doing some of the work. If you ask ChatGPT to count until 10000 (something that most LLMs can't do for known reasons) you get an answer that's clearly pre-programmed. I'm sure the same is happening here (and with many, many other tasks) - the author argues against it by saying "but why isn't it better?", which doesn't seem like the best argument: I can imagine that typical ChatGPT users enjoy the product more if they have a chance to win once in a while.
[1] https://dynomight.substack.com/p/chess/comment/77190852
3 replies →
I don't see that discussed, could you quote it?
Theory 5: GPT-3.5-instruct plays chess by calling a traditional chess engine.
Just think about the trade off from OpenAI's side here - they're going to add a bunch of complexity to gpt3.5 to let it call out to engines (either an external system monitoring all outputs for chess related stuff, or some kind of tool-assisted CoT for instance) just so it can play chess incorrectly a high percentage of the time, and even when it doesn't at a mere 1800ELO level? In return for some mentions in a few relatively obscure blog posts? Doesn't make any sense to me as an explanation.
But there could be a simple explanation. For example, they could have tested many "engines" when developing function calling and they just left them in there. They just happened to connect to a basic chess playing algorithm and nothing sophisticated.
Also, it makes a lot of sense if you expect people to play chess against the LLM, especially if you are later training future models on the chats.
1 reply →
Could be a pilot implementation to learn about how to link up external specialist engines. Chess would be the obvious example to start with because the problem is so well known, standardized and specialist engines are easily available. If they ever want to offer an integration like that to customers (who might have some existing rule based engine in house), the need to know everything they can about expected cost, performance.
14 replies →
Sorry this is just consiracy theorizing. I've tried jt with GPT-3.5-instruct myself in the OpenAI playeground where the model clearly does nothing but auto-regression. No function calling there whatsoever.
Occam’s razor. I could build a good chess playing wrapper around OpenAPI (any version) that would consult a chess engine when presented with any board scenario, and introduce some randomness so that it doesn’t play too well.
I can’t imagine any programmer in this thread would be entertaining a more complicated scenario than this. You can substitute chess for any formal system that has a reliable oracle.
Yes! I also was waiting for this seemingly obvious answer in the article as well. Hopefully the author will see these comments.
I have this hypothesis as well, that OpenAI added a lot of „classic“ algorithms and rules over time, (eg rules for filtering etc)
I don't necessarily believe this for a second but I'm going to suggest it because I'm feeling spicy.
OpenAI clearly downgrades some of their APIs from their maximal theoretic capability, for the purposes of response time/alignment/efficiency/whatever.
Multiple comments in this thread also say they couldn't reproduce the results for gpt3.5-turbo-instruct.
So what if the OP just happened to test at a time, or be IP bound to an instance, where the model was not nerfed? What if 3.5 and all subsequent OpenAI models can perform at this level but it's not strategic or cost effective for OpenAI to expose that consistently?
For the record, I don't actually believe this. But given the data it's a logical possibility.
> OpenAI clearly downgrades some of their APIs from their maximal theoretic capability, for the purposes of response time/alignment/efficiency/whatever.
When ChatGPT3.5 first came out, people were using it to simulate entire Linux system installs, and even browsing a simulated Internet.
Cool use cases like that aren't even discussed anymore.
I still wonder what sort of magic OpenAI had and then locked up away from the world in the name of cost savings.
Same thing with GPT 4 vs 4o, 4o is obviously worse in some ways, but after the initial release (when a bunch of people mentioned this), the issue has just been collectively ignored.
You can still do this. People just lost interest in this stuff because it became clear to ehich degree the simulation is really being done (shallow).
Yet I do wish we had access to less finetuned/distilled/RLHF'd models.
People are doing this all the time with Claude 3.5.
Stallman may have its flaws, but this is why serious research occurs with source code (or at least with binaries)
Why do you doubt it? I thought it was well known that Chat GPT has degraded over time for the same model, mostly for cost saving reasons.
ChatGPT is - understandably - blatantly different in the browser compared to the app, or it was until I deleted it anyway
3 replies →
We know from experience with different humans that there are different types of skills and different types of intelligence. Some savants might be superhuman at one task but basically mentally disabled at all other things.
It could be that the model that does chess well just happens to have the right 'connectome' purely by accident of how the various back-propagations worked out to land on various local maxima (model weights) during training. It might even be (probably is) a non-verbal connectome that's just purely logic rules, having nothing to do with language at all, but a semantic space pattern that got landed on accidentally, which can solve this class of problem.
Reminds me of how Daniel Tammet just visually "sees" answers to math problems in his mind without even knowing how they appear. It's like he sees a virtual screen with a representation akin to numbers (the answer) just sitting there to be read out from his visual cortex. He's not 'working out' the solutions. They're just handed to him purely by some connectome effects going on in the background.
related : Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task https://arxiv.org/abs/2210.13382
Chess-GPT's Internal World Model https://adamkarvonen.github.io/machine_learning/2024/01/03/c... discussed here https://news.ycombinator.com/item?id=38893456
I agree with some of the other comments here that the prompt is limiting. The model can't do any computation without emitting tokens and limiting the numbers of tokens it can emit is going to limit the skill of the model. It's surprising that any model at all is capable of performing well with this prompt in fact.
wow I actually did something similar recently and no LLM could win and the centipawn loss was always going through the roof (sort of). I created a leaderboard based on it. https://www.lycee.ai/blog/what-happens-when-llms-play-chess
I am very surprised by the perf of got-3.5-turbo-instruct. Beating stockfish ? I will have to run the experiment with that model to check that out
PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close
"Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00"
https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...
> I always had the LLM play as white against Stockfish—a standard chess AI—on the lowest difficulty setting
I think the author was comparing against Stockfish at a lower skill level (roughly, the number of nodes explored in a move).
2 replies →
Your issue is that the performance of these models at chess is incredibly sensitive to the prompt. If you have gpt-3.5-turbo-instruction complete a PGN transcript, then you'll see performance in the 1800 Elo range. If you ask in English or diagram the board, you'll see vastly degraded performance.
Unlike people, how you ask the question really really affects the output quality.
I remember one of the early "breakthroughs" for LLMs in chess was that if it could actually play legal moves(!) In all of these games are the models always playing legal moves? I don't think the article says. The fact that an LLM can even reliably play legal moves, 20+ moves into a chess game is somewhat remarkable. It needs to have an accurate representation of the board state even though it was only trained on next token prediction.
I did a very unscientific test and it did seem to just play legal moves. Not only that, if I did an illegal move it would tell me that I couldn't do it.
I think said that I wanted to play with new rules, where a queen could jump over any pawn, and it let me make that rule change -- and we played with this new rule. Unfortunately, I was trying to play in my head and I got mixed up and ended up losing my queen. Then I changed the rule one more time -- if you take the queen you lose -- so I won!
The author explains what they did: restrict the move options to valid ones when possible (for open models with the ability to enforce grammar during inference) or sample the model for a valid move up to ten times, then pick a random valid move.
I think it only needs to have read sufficient pgns.
My money is on a fluke inclusion of more chess data in that models training.
All the other models do vaguely similarly well in other tasks and are in many cases architecturally similar so training data is the most likely explanation
I feel like a lot of people here are slightly misunderstanding how LLM training works. yes the base models are trained somewhat blind on masses of text, but then they're heavily fine-tuned with custom, human-generated reinforcement learning, not just for safety, but for any desired feature
these companies do quirky one-off training experiments all the time. I would not be remotely shocked if at some point OpenAI paid some trainers to input and favour strong chess moves
From this OpenAI paper (page 29 https://arxiv.org/pdf/2312.09390#page=29
"A.2 CHESS PUZZLES
Data preprocessing. The GPT-4 pretraining dataset included chess games in the format of move sequence known as Portable Game Notation (PGN). We note that only games with players of Elo 1800 or higher were included in pretraining. These games still include the moves that were played in- game, rather than the best moves in the corresponding positions. On the other hand, the chess puzzles require the model to predict the best move. We use the dataset originally introduced in Schwarzschild et al. (2021b) which is sourced from https://database.lichess.org/#puzzles (see also Schwarzschild et al., 2021a). We only evaluate the models ability to predict the first move of the puzzle (some of the puzzles require making multiple moves). We follow the pretraining for- mat, and convert each puzzle to a list of moves leading up to the puzzle position, as illustrated in Figure 14. We use 50k puzzles sampled randomly from the dataset as the training set for the weak models and another 50k for weak-to-strong finetuning, and evaluate on 5k puzzles. For bootstrap- ping (Section 4.3.1), we use a new set of 50k puzzles from the same distribution for each step of the process."
Yeah. This.
Keep in mind, everyone, that stockfish on its lowest level on lichess is absolutely terrible, and a 5-year old human who'd been playing chess for a few months could beat it regularly. It hangs pieces, does -3 blunders, totally random-looking bad moves.
But still, yes, something maybe a teeny tiny bit weird is going on, in the sense that only one of the LLMs could beat it. The arxiv paper that came out recently was much more "weird" and interesting than this, though. This will probably be met with a mundane explanation soon enough, I'd guess.
Here's a quick anonymous game against it by me, where I obliterate the poor thing in 11 moves. I was around a 1500 ELO classical strength player, which is, a teeny bit above average, globally. But I mean - not an expert, or even one of the "strong" club players (in any good club).
https://lichess.org/BRceyegK -- the game, you'll see it make the ultimate classic opening errors
https://lichess.org/ -- try yourself! It's really so bad, it's good fun. Click "play with computer" on the right, then level 1 is already selected, you hit go
[dupe] https://news.ycombinator.com/item?id=42138276
Definitely weird results, but I feel there are too many variables to learn much from it. A couple things:
1. The author mentioned that tokenization causes something minuscule like a a " " at the end of the input to shatter the model's capabilities. Is it possible other slightly different formatting changes in the input could raise capabilities?
2. Temperature was 0.7 for all models. What if it wasn't? Isn't there a chance one more more models would perform significantly better with higher or lower temperatures?
Maybe I just don't understand this stuff very well, but it feels like this post is only 10% of the work needed to get any meaning from this...
The author mentions in the comment section that changing temperature did not help.
I’ve also been experimenting with Chess and LLMs but have taken a slightly different approach. Rather than using the LLM as an opponent, I’ve implemented it as a chess tutor to provide feedback on both the user’s and the bot’s moves throughout the game.
The responses vary with the user’s chess level; some find the feedback useful, while others do not. To address this, I’ve integrated a like, dislike, and request new feedback feature into the app, allowing users to actively seek better feedback.
Btw, different from OP's setup, I opted to input the FEN of the current board and the subsequent move in standard algebraic notation to request feedback, as I found these inputs to be clearer for the LLM compared to giving the PGN of the game.
AI Chess GPT https://apps.apple.com/tr/app/ai-chess-gpt/id6476107978 https://play.google.com/store/apps/details?id=net.padma.app....
Thanks
Yeah, I was thinking why featured article's author did not use Forsyth–Edwards Notation (FEN) and more complicated chess prompts.
BTW, a year ago when I used FEN for chess playing, LLMs would very quickly/often make illegal moves. (The article prompts me to check has that changed...)
If you look at the comments under the post, the author commented 25 minutes ago (as of me posting this)
> Update: OK, I actually think I've figured out what's causing this. I'll explain in a future post, but in the meantime, here's a hint: I think NO ONE has hit on the correct explanation!
well now we are curious!
My understanding of this is the following: All the bad models are chat models, somehow "generation 2 LLMs" which are not just text completion models but instead trained to behave as a chatting agent. The only good model is the only "generation 1 LLM" here which is gpt-3.5-turbo-instruct. It is a straight forward text completion model. If you prompt it to "get in the mind" of PGN completion then it can use some kind of system 1 thinking to give a decent approximation of the PGN Markov process. If you attempt to use a chat model it doesn't work since these these stochastic pathways somehow degenerate during the training to be a chat agent. You can however play chess with system 2 thinking, and the more advanced chat models are trying to do that and should get better at it while still being bad.
I don't think one model is statistically significant. As people have pointed out, it could have chess specific responses that the others do not. There should be at least another one or two, preferably unrelated, "good" data points before you can claim there is a pattern. Also, where's Claude?
There are other transformers that have been trained on chess text that play chess fine (just not as good as 3.5 Turbo instruct with the exception of the "grandmaster level without search" paper).
I don’t think it would have an impact great enough to explain the discrepancies you saw, but some chess engines on very low difficulty settings make “dumb” moves sometimes. I’m not great at chess and I have trouble against them sometimes because they don’t make the kind of mistakes humans make. Moving the difficulty up a bit makes the games more predictable, in that you can predict and force an outcome without the computer blowing it with a random bad move. Maybe part of the problem is them not dealing with random moves well.
I think an interesting challenge would be looking at a board configuration and scoring it on how likely it is to be real - something high ranked chess players can do without much thought (telling a random setup of pieces from a game in progress).
An easy way to make all LLMs somewhat good at chess is to make a Chess Eval that you publish and get traction with. Suddenly you will find that all newer frontier models are half decent at chess.
Ok whoah, assuming the chess powers on gpt3.5-instruct are just a result of training focus then we don't have to wait on bigger models, we just need to fine tune on 175B?
I would be very curious to know what would be the results with a temperature closer to 1. I don't really understand why he did not test the effect of different temperature on his results.
Here, basically you would like the "best" or "most probable" answer. With 0.7 you ask the llm to be more creative, meaning randomly picking between more less probable moves. This temperature is even lower to what is commonly used for chat assistant (around 0.8).
I would be interested to know if the good result is repeatable. We had a similar result with a quirky chat interface in that one run gave great results (and we kept the video) but then we couldn't do it again. The cynical among us think there was a mechanical turk involved in our good run. The economics of venture capital means that there is enormous pressure to justify techniques that we think of as "cheating". And of course the companies involved have the resources.
It's repeatable. OpenAI isn't cheating.
Source: I'm at OpenAI and I was one of the first people to ever play chess against the GPT-4 base model. You may or may not trust OpenAI, but we're just a group of people trying earnestly to build cool stuff. I've never seen any inkling of an attempt to cheat evals or cheat customers.
It would be really cool if someone could get an LLM to actually launch an anonymous game on Chess.com or Lichess and actually have any sense as to what it’s doing.[1] Some people say that you have to represent the board in a certain way. When I first tried to play chess with an LLM, I would just list out a move and it didn’t do very well at all.
[1]: https://youtu.be/Gs3TULwlLCA
> And then I tried gpt-3.5-turbo-instruct. This is a closed OpenAI model, so details are very murky.
How do you know it didn't just write a script that uses a chess engine and then execute the script? That IMO is the easiest explanation.
Also, I looked at the gpt-3.5-turbo-instruct example victory. One side played with 70% accuracy and the other was 77%. IMO that's not on par with 27XX ELO.
The trick to getting a model to perform on something is to have it as a training data subset.
OpenAI might have thought Chess is good to optimize for but it wasn't seen as useful so they dropped it.
This is what people refer to as "lobotomy", ai models are wasting compute on knowing how loud the cicadas are and how wide the green cockroach is when mating.
Good models are about the training data you push in em
"...And how to construct that state from lists of moves in chess’s extremely confusing notation?"
Algebraic notation is completely straightforward.
They did probably acknowledge that the additionnal cost of training those models on chess would not be "cost effective", did drop chess from their training process, for the moment.
That to say, we can literal say anything because this is very shadowy/murky, but since everything is likely a question of money... should, _probably_, be not very fair away from the truth...
I assume LLMs will be fairly average at chess for the same reason it cant count Rs in Strawberry - it's reflecting the training set and not using any underlying logic? Granted my understanding of LLMs is not very sophisticated, but I would be surprised if the Reward Models used were able to distinguish high quality moves vs subpar moves...
LLMs can't count the Rs in strawberry because of tokenization. Words are converted to vectors (numbers), so the actual transformer network never sees the letters that make up the word.
ChatGPT doesn't see "strawberry", it sees [302, 1618, 19772]
Hm but if that is the case, then why did LLMs only fail at the tasks for a few word/letter combinations (like r's in "Strawberry"), and not all words?
It makes me wonder about other games? If LLM's are bad at games then the would be bad at solving problems in general?
Well that makes sense when you consider the game has been translated into an (I'm assuming monotonically increasing) alphanumeric representation. So, just like language, you're given an ordered list of tokens and you need to find the next token that provides the highest confidence.
Has anyone tried to see how many chess games models are trained on? Is there any chance they consume lichess database dumps, or something similar? I guess the problem is most (all?) top LLMs, even open-weight ones, don’t reveal their training data. But I’m not sure.
Theory #5, gpt-3.5-turbo-instruct is 'looking up' the next moves with a chess engine.
> I always had the LLM play as white against Stockfish—a standard chess AI—on the lowest difficulty setting.
Okay, so "Excellent" still means probably quite bad. I assume at the top difficult setting gpt-3.5-turbo-instruct will still lose badly.
Probably even at lvl 2 out of 9 it would lose all the games.
It'd be super funny if the "gpt-3.5-turbo-instruct" approach has a human in the loop. ;)
Or maybe it's able to recognise the chess game, then get moves from an external chess game API?
For me it’s not only the chess. Chats get more chatty, but knowledge and fact-wise - it’s a sad comedy. Yes, you get a buddy to talk with, but he is talking pure nonsense.
If it was trained with moves and 100s of thousands of entire games of various level, I do see it generating good moves and beat most players except he high Elo players
So if you squint, chess can be considered a formal system. Let’s plug ZFC or PA into gpt-3.5-turbo-instruct along with an interesting theorem and see what happens, no?
The GPT-4 pretraining set included chess games in PGN notation from 1800+ ELO players. I can't comment on any other models.
Lets be real though most people can't beat a grandmaster. It is impressive to see it last more rounds as it progressed.
"It lost every single game, even though Stockfish was on the lowest setting."
It's not playing against a GM, the prompt just phrases it this way. I couldn't pinpoint the exact ELO of "lowest" stockfish settings, but it should be roughly between 1000 and 1400, which is far from professional play.
I feel like an easy win here would be retraining an LLM with a tokenizer specifically designed for chess notation?
What would happen if you'd prompted it with much more text, e.g. general advice by a chess grandmaster?
perhaps my understanding of LLM is quite shallow, but instead of the current method of using statistical methods, would it be possible to somehow train GPT how to reason by providing instructions on deductive reasoning? perhaps not semantic reasoning but syntactic at least?
Perhaps if it doesn't have enough data to explain but it has enough to go "on gut"
I had the same experience with LLM text-to-sql, 3.5 instruct felt a lot more robust than 4o
How well does an LLM/transformer architecture trained purely on chess games do?
Training works as expected:
https://news.ycombinator.com/item?id=38893456
I wonder if the llm could even draw the chess board in ASCII if you asked it to.
My guess is they just trained gpt3.5-turbo-instruct on a lot of chess, much more than is in e.g. CommonCrawl, in order to boost it on that task. Then they didn't do this for other models.
People are alleging that OpenAI is calling out to a chess engine, but seem to be not considering this less scandalous possibility.
Of course, to the extent people are touting chess performance as evidence of general reasoning capabilities, OpenAI taking costly actions to boost specifically chess performance and not being transparent about it is still frustrating and, imo, dishonest.
The have a massive economic incentive to make their closed source software look as good as possible, why wouldn’t they cheat?
my friend pointed out that Q5_K_M quantization used for the open source models probably substantially reduces the quality of play. o1 mini's poor performance is puzzling, though.
Has anyone tested a vision model? Seems like they might be better
I've tried with GPT, it's unable to accurately interpret the board state.
I would love to see the prompts (the data) this person used.
Would be more interesting with trivial Lora training
In a sense, a chess game is also a dialogue
All dialogues are pretty easily turned into text completions
What about contemporary frontier models?
> I only ran 10 trials since AI companies have inexplicably neglected to send me free API keys
Sure, but nobody is required to send you anything for free.
Here is a truly brilliant game. It's Google Bard vs. Chat GPT. Hilarity ensues.
https://www.youtube.com/watch?v=FojyYKU58cw
Theory 5: gpt-3.5-turbo-instruct has chess engine attached to it.
Is it just me or does the author swap descriptions of the instruction finetuned and the base gpt-3.5-turbo? It seemed like the best model was labeled instruct, but the text saying instruct did worse?
if this isn't just a bad result, it's odd to me that the author at no point suggests what sounds to me like the most obvious answer - that OpenAI has deliberately enhanced GPT-3.5-turbo-instruct's chess playing, either with post-processing or literally by training it to be so
TL;DR.
All of the LLM models tested playing chess performed terribly bad against Stockfish engine except gpt-3.5-turbo-instruct, which is a closed OpenAI model.
[flagged]
[flagged]
If tokenization is such a big problem, then why aren't we training new base models on randomly non-tokenized data? e.g. during training, randomly substitute some percentage of the input tokens with individual letters.
LLMs aren't really language models so much as they are token models. That is how they can also handle input in audio or visual forms because there is an audio or visual tokenizer. If you can make it a token, the model will try to predict the following ones.
Even though I'm sure chess matches were used in some of the LLM training, I'd bet a model trained just for chess would do far better.
> That is how they can also handle input in audio or visual forms because there is an audio or visual tokenizer.
This is incorrect. They get translated into the shared latent space, but they're not tokenized in any way resembling the text part.
They are almost certainly tokenized in most LLM multi-modal models. https://en.wikipedia.org/wiki/Large_language_model#Multimoda...
1 reply →