2. It plays like what you'd expect from a LLM that could play chess. That is, level of play can be modulated by the prompt and doesn't manifest the same way shifting the level of stockfish etc does. Also the specific chess notation being prompted actually matters
5. You can or well you used to be able to inspect the logprobs. I think Open AI have stopped doing this but the link in 4 does show the author inspecting it for Turbo instruct.
> Also the specific chess notation being prompted actually matters
Couldn't this be evidence that it is using an engine? Maybe if you use the wrong notation it relies on the ANN rather than calling to the engine.
Likewise:
- The sensitivity to game history is interesting, but is it actually true that other chess engines only look at current board state? Regardless, maybe it's not an existing chess engine! I would think OpenAI has some custom chess engine built as a side project, PoC, etc. In particular this engine might be neural and trained on actual games rather than board positions, which could explain dependency on past moves. Note that the engine is not actually very good. Does AlphaZero depend on move history? (Genuine question, I am not sure. But it does seem likely.)
- I think the illegal moves can be explained similarly to why gpt-o1 sometimes screws up easy computations despite having access to Python: an LLM having access to a tool does not guarantee it always uses that tool.
I realize there are holes in the argument, but I genuinely don't think these holes are as big as the "why is gpt-3.5-turbo-instruct so much better at chess than gpt-4?"
> Couldn’t this be evidence that it is using an engine?
A test would be to measure its performance against more difficult versions of Stockfish. A real chess engine would have a higher ceiling.
Much more likely is this model was trained on more chess PGNs. You can call that a “neural engine” if you’d like but it is the simplest solution and explains the mistakes it is making.
Game state isn’t just what you can see on the board. It includes the 50 move rule and castling rights. Those were encoded as layers in AlphaZero along with prior positions of pieces. (8 prior positions if I’m remembering correctly.)
The author thinks this is unlikely because it only has an ~1800 ELO. But OpenAI is shady as hell, and I could absolutely see the following purely hypothetical scenario:
- In 2022 Brockman and Sutskever have an unshakeable belief that Scaling Is All You Need, and since GPT-4 has a ton of chess in its pretraining data it will definitely be able to play competent amateur chess when it's finished.
- A ton of people have pointed out that ChatGPT-3.5 doesn't even slightly understand chess despite seeming fluency in the lingo. People start to whisper that transformers cannot actually create plans.
- Therefore OpenAI hatches an impulsive scheme: release an "instruction-tuned" GPT-3.5 with an embedded chess engine that is not a grandmaster, but can play competent chess, ideally just below the ELO that GPT-4 is projected to have.
- Success! The waters are muddied: GPT enthusiasts triumphantly announce that LLMs can play chess, it just took a bit more data and fine-tuning. The haters were wrong: look at all the planning GPT is doing!
- Later on, at OpenAI HQ...whoops! GPT-4 sucks at chess, as do competitors' foundation LLMs which otherwise outperform GPt-3.5. The scaling "laws" failed here, since they were never laws in the first place. OpenAI accepts that scaling transformers won't easily solve the chess problem, then realizes that if they include the chess engine with GPT-4 without publicly acknowledging it, then Anthropic and Facebook will call out the performance as aberrational and suspicious. But publicly acknowledging a chess engine is even worse: the only reason to include the chess engine is to mislead users into thinking GPT is capable of general-purpose planning.
- Therefore in later GPT versions they don't include the engine, but it's too late to remove it from gpt-3.5-turbo-instruct: people might accept the (specious) claim that GPT-4's size accidentally sabotaged its chess abilities, but they'll ask tough questions about performance degradation within the same model.
I realize this is convoluted and depends on conjecture. But OpenAI has a history with misleading demos - e.g. their Rubik's cube robot which in fact used a classical algorithm but was presented as reinforcement learning. I think "OpenAI lied" is the most likely scenario. It is far more likely than "OpenAI solved the problem honestly in GPT-3.5, but forgot how they did it with GPT-4," and a bit more likely than "scaling transformers slightly helps performance when playing Othello but severely sabotages performance when playing chess."
Eh, OpenAI really isn't as shady as hell, from what I've seen on the inside for 3 years. Rubik's cube hand was before me, but in my time here I haven't seen anything I'd call shady (though obviously the non-disparagement clauses were a misstep that's now been fixed). Most people are genuinely trying to build cool things and do right by our customers. I've never seen anyone try to cheat on evals or cheat customers, and we take our commitments on data privacy seriously.
I was one of the first people to play chess against the base GPT-4 model, and it blew my mind by how well it played. What many people don't realize is that chess performance is extremely sensitive to prompting. The reason gpt-3.5-turbo-instruct does so well is that it can be prompted to complete PGNs. All the other models use the chat format. This explains pretty much everything in the blog post. If you fine-tune a chat model, you can pretty easily recover the performance seen in 3.5-turbo-instruct.
Very good scenario. One variation: some researcher or division in OpenAI performs all of the above steps to get a raise. The whole field is predicated on rewarding the appearance of ability.
It's pretty convoluted, requires a ton of steps, mind-reading, and odd sequencing.*
If you share every prior, and aren't particularly concerned with being disciplined in treating conversation as proposing a logical argument (I'm not myself, people find it offputting), it probably wouldn't seem at all convoluted.
* layer chess into gpt-3.5-instruct only, but not chatgpt, not GPT-4, to defeat the naysayers when GPT-4 comes out? shrugs if the issues with that are unclear, I can lay it out more
** fwiw, at the time, pre-chatgpt, before the hype, there wasn't a huge focus on chess, nor a ton of naysayers to defeat. it would have been bizarre to put this much energy into it, modulo the scatter-brained thinking in *
This is likely. From example games, it not only knows the rules (which would be impressive by itself, just making the legal moves is not trivial). It also has some planning capabilities (plays combinations of several moves).
Probably not calling out to one but it would not surprise me at all if they added more chess PGNs into their training data. Chess is a bit special in AI in that it’s still seen as a mark of pure intelligence in some respect.
If you tested it on an equally strategic but less popular game I highly doubt you would see the same performance.
I think that's the most plausible theory that would explain the sudden hike from gpt-3.5-turbo to gpt-3.5-turbo-instruct, and again the sudden regression in gpt-4*.
OpenAI also seem to augment the LLM with some type of VM or a Python interpreter. Maybe they run a simple chess engine such as Sunfish [1] which is around 1900-2000 ELO [2]?
Note: the possibility is not mentioned in the article but rather in the comments [1]. I had to click a bit to see it.
The fact that the one closed source model is the only one that plays well seems to me like a clear case of the interface doing some of the work. If you ask ChatGPT to count until 10000 (something that most LLMs can't do for known reasons) you get an answer that's clearly pre-programmed. I'm sure the same is happening here (and with many, many other tasks) - the author argues against it by saying "but why isn't it better?", which doesn't seem like the best argument: I can imagine that typical ChatGPT users enjoy the product more if they have a chance to win once in a while.
What do you mean LLMs can't count to 10,000 for known reasons?
Separately, if you are able to show OpenAI is serving pre canned responses in some instances, instead of running inference, you will get a ton of attention if you write it up.
I'm not saying this in an aggro tone, it's a genuinely interesting subject to me because I wrote off LLMs at first because I thought this was going on.* Then I spent the last couple years laughing at myself for thinking that they would do that. Would be some mix of fascinated and horrified to see it come full circle.
* I can't remember, what, exactly, it was far back as 2018. But someone argued that OpenAI was patching in individual answers because scaling was dead and they had no answers, way way before ChatGPT.
It's not:
1. That would just be plain bizzare
2. It plays like what you'd expect from a LLM that could play chess. That is, level of play can be modulated by the prompt and doesn't manifest the same way shifting the level of stockfish etc does. Also the specific chess notation being prompted actually matters
3. It's sensitive to how the position came to be. Clearly not an existing chess engine. https://github.com/dpaleka/llm-chess-proofgame
4. It does make illegal moves. It's rare (~5 in 8205) but it happens. https://github.com/adamkarvonen/chess_gpt_eval
5. You can or well you used to be able to inspect the logprobs. I think Open AI have stopped doing this but the link in 4 does show the author inspecting it for Turbo instruct.
[dead]
> Also the specific chess notation being prompted actually matters
Couldn't this be evidence that it is using an engine? Maybe if you use the wrong notation it relies on the ANN rather than calling to the engine.
Likewise:
- The sensitivity to game history is interesting, but is it actually true that other chess engines only look at current board state? Regardless, maybe it's not an existing chess engine! I would think OpenAI has some custom chess engine built as a side project, PoC, etc. In particular this engine might be neural and trained on actual games rather than board positions, which could explain dependency on past moves. Note that the engine is not actually very good. Does AlphaZero depend on move history? (Genuine question, I am not sure. But it does seem likely.)
- I think the illegal moves can be explained similarly to why gpt-o1 sometimes screws up easy computations despite having access to Python: an LLM having access to a tool does not guarantee it always uses that tool.
I realize there are holes in the argument, but I genuinely don't think these holes are as big as the "why is gpt-3.5-turbo-instruct so much better at chess than gpt-4?"
> Couldn’t this be evidence that it is using an engine?
A test would be to measure its performance against more difficult versions of Stockfish. A real chess engine would have a higher ceiling.
Much more likely is this model was trained on more chess PGNs. You can call that a “neural engine” if you’d like but it is the simplest solution and explains the mistakes it is making.
Game state isn’t just what you can see on the board. It includes the 50 move rule and castling rights. Those were encoded as layers in AlphaZero along with prior positions of pieces. (8 prior positions if I’m remembering correctly.)
The author thinks this is unlikely because it only has an ~1800 ELO. But OpenAI is shady as hell, and I could absolutely see the following purely hypothetical scenario:
- In 2022 Brockman and Sutskever have an unshakeable belief that Scaling Is All You Need, and since GPT-4 has a ton of chess in its pretraining data it will definitely be able to play competent amateur chess when it's finished.
- A ton of people have pointed out that ChatGPT-3.5 doesn't even slightly understand chess despite seeming fluency in the lingo. People start to whisper that transformers cannot actually create plans.
- Therefore OpenAI hatches an impulsive scheme: release an "instruction-tuned" GPT-3.5 with an embedded chess engine that is not a grandmaster, but can play competent chess, ideally just below the ELO that GPT-4 is projected to have.
- Success! The waters are muddied: GPT enthusiasts triumphantly announce that LLMs can play chess, it just took a bit more data and fine-tuning. The haters were wrong: look at all the planning GPT is doing!
- Later on, at OpenAI HQ...whoops! GPT-4 sucks at chess, as do competitors' foundation LLMs which otherwise outperform GPt-3.5. The scaling "laws" failed here, since they were never laws in the first place. OpenAI accepts that scaling transformers won't easily solve the chess problem, then realizes that if they include the chess engine with GPT-4 without publicly acknowledging it, then Anthropic and Facebook will call out the performance as aberrational and suspicious. But publicly acknowledging a chess engine is even worse: the only reason to include the chess engine is to mislead users into thinking GPT is capable of general-purpose planning.
- Therefore in later GPT versions they don't include the engine, but it's too late to remove it from gpt-3.5-turbo-instruct: people might accept the (specious) claim that GPT-4's size accidentally sabotaged its chess abilities, but they'll ask tough questions about performance degradation within the same model.
I realize this is convoluted and depends on conjecture. But OpenAI has a history with misleading demos - e.g. their Rubik's cube robot which in fact used a classical algorithm but was presented as reinforcement learning. I think "OpenAI lied" is the most likely scenario. It is far more likely than "OpenAI solved the problem honestly in GPT-3.5, but forgot how they did it with GPT-4," and a bit more likely than "scaling transformers slightly helps performance when playing Othello but severely sabotages performance when playing chess."
Eh, OpenAI really isn't as shady as hell, from what I've seen on the inside for 3 years. Rubik's cube hand was before me, but in my time here I haven't seen anything I'd call shady (though obviously the non-disparagement clauses were a misstep that's now been fixed). Most people are genuinely trying to build cool things and do right by our customers. I've never seen anyone try to cheat on evals or cheat customers, and we take our commitments on data privacy seriously.
I was one of the first people to play chess against the base GPT-4 model, and it blew my mind by how well it played. What many people don't realize is that chess performance is extremely sensitive to prompting. The reason gpt-3.5-turbo-instruct does so well is that it can be prompted to complete PGNs. All the other models use the chat format. This explains pretty much everything in the blog post. If you fine-tune a chat model, you can pretty easily recover the performance seen in 3.5-turbo-instruct.
There's nothing shady going on, I promise.
Very good scenario. One variation: some researcher or division in OpenAI performs all of the above steps to get a raise. The whole field is predicated on rewarding the appearance of ability.
Not that convoluted really
It's pretty convoluted, requires a ton of steps, mind-reading, and odd sequencing.*
If you share every prior, and aren't particularly concerned with being disciplined in treating conversation as proposing a logical argument (I'm not myself, people find it offputting), it probably wouldn't seem at all convoluted.
* layer chess into gpt-3.5-instruct only, but not chatgpt, not GPT-4, to defeat the naysayers when GPT-4 comes out? shrugs if the issues with that are unclear, I can lay it out more
** fwiw, at the time, pre-chatgpt, before the hype, there wasn't a huge focus on chess, nor a ton of naysayers to defeat. it would have been bizarre to put this much energy into it, modulo the scatter-brained thinking in *
1 reply →
This is likely. From example games, it not only knows the rules (which would be impressive by itself, just making the legal moves is not trivial). It also has some planning capabilities (plays combinations of several moves).
Probably not calling out to one but it would not surprise me at all if they added more chess PGNs into their training data. Chess is a bit special in AI in that it’s still seen as a mark of pure intelligence in some respect.
If you tested it on an equally strategic but less popular game I highly doubt you would see the same performance.
I think that's the most plausible theory that would explain the sudden hike from gpt-3.5-turbo to gpt-3.5-turbo-instruct, and again the sudden regression in gpt-4*.
OpenAI also seem to augment the LLM with some type of VM or a Python interpreter. Maybe they run a simple chess engine such as Sunfish [1] which is around 1900-2000 ELO [2]?
[1] https://github.com/thomasahle/sunfish
[2] https://lichess.org/@/sunfish-engine
this possibility is discussed in the article and deemed unlikely
Note: the possibility is not mentioned in the article but rather in the comments [1]. I had to click a bit to see it.
The fact that the one closed source model is the only one that plays well seems to me like a clear case of the interface doing some of the work. If you ask ChatGPT to count until 10000 (something that most LLMs can't do for known reasons) you get an answer that's clearly pre-programmed. I'm sure the same is happening here (and with many, many other tasks) - the author argues against it by saying "but why isn't it better?", which doesn't seem like the best argument: I can imagine that typical ChatGPT users enjoy the product more if they have a chance to win once in a while.
[1] https://dynomight.substack.com/p/chess/comment/77190852
What do you mean LLMs can't count to 10,000 for known reasons?
Separately, if you are able to show OpenAI is serving pre canned responses in some instances, instead of running inference, you will get a ton of attention if you write it up.
I'm not saying this in an aggro tone, it's a genuinely interesting subject to me because I wrote off LLMs at first because I thought this was going on.* Then I spent the last couple years laughing at myself for thinking that they would do that. Would be some mix of fascinated and horrified to see it come full circle.
* I can't remember, what, exactly, it was far back as 2018. But someone argued that OpenAI was patching in individual answers because scaling was dead and they had no answers, way way before ChatGPT.
2 replies →
I don't see that discussed, could you quote it?