Comment by bubblyworld
1 year ago
If you want to keep adding conditions to an already-complex theory, you'll need an equally complex set of observations to justify it.
1 year ago
If you want to keep adding conditions to an already-complex theory, you'll need an equally complex set of observations to justify it.
You're the one imposing an additional criterion, that OpenAI must have chosen the highest setting on a chess engine, and demanding that this additional criterion be used to explain the facts.
I agree with GP that if a 'fine tuning' of GPT 3.5 came out the gate playing at top Stockfish level, people would have been extremely suspicious of that. So in my accounting of the unknowns here, the fact that it doesn't play at the top level provides no additional information with which to resolve the question.
That's not an additional criterion, it's simply the most likely version of this hypothetical - a superhuman engine is much easier to integrate than an 1800 elo engine that makes invalid moves, for the simple reason that the vast majority of chess engines play at >1800 elo out of the box and don't make invalid moves ever (they are way past that level on a log-scale, actually).
This doesn't require the "highest" settings, it requires any settings whatsoever.
But anyway to spell out some of the huge list of unjustified conditions here:
1. OpenAI spent a lot of time and money R&Ding chess into 3.5-turbo-instruct via external call.
2. They used a terrible chess engine for some reason.
3. They did this deliberately because they didn't want to get "caught" for some reason.
4. They removed this functionality in all other versions of gpt for some reason ...etc
Much simpler theory:
1. They used more chess data training that model.
(there are other competing much simpler theories too)
My point is that given a prior of 'wired in a chess engine', my posterior odds that they would make it plausibly-good and not implausibly-good approaches one.
For a variety of boring reasons, I'm nearly convinced that what they did was either, as you say, train heavily on chess texts, or a plausible variation of using mixture-of-experts and having one of them be an LLM chess savant.
Most of the sources I can find on the ELO of Stockfish at the lowest setting are around 1350, so that part also contributes no weights to the odds, because it's trivially possible to field a weak chess engine.
The distinction between prior and posterior odds is critical here. Given a decision to cheat (which I believe is counterfactual on priors), all of the things you're trying to Occam's Razor here are trivially easy to do.
So the only interesting considerations are the ones which factor into the likelihood of them deciding to cheat. If you even want to call it that, shelling out to a chess engine is defensible, although the stochastic fault injection (which is five lines of Python) in that explanation of the data does feel like cheating to me.
What I do consider relevant is that, based on what I know of LLMs, intensively training one to emit chess tokens seems almost banal in terms of outcomes. Also, while I don't trust OpenAI company culture much, I do think they're more interested in 'legitimately' weighting their products to pass benchmarks, or just building stuff with LLMs if you prefer.
I actually think their product would benefit from more code which detects "stuff normal programs should be doing" and uses them. There's been somewhat of a trend toward that, which makes the whole chatbot more useful. But I don't think that's what happened with this one edition of GPT 3.5.
1 reply →