Comment by bubblyworld
1 year ago
This doesn't address its terrible performance. If it were touching anything like a real engine it would be playing at a superhuman level, not the level of a upper-tier beginner.
1 year ago
This doesn't address its terrible performance. If it were touching anything like a real engine it would be playing at a superhuman level, not the level of a upper-tier beginner.
The way I read the article is that it's just as terrible as you would expect it to be from pure word association, except for one version that's an outlier in not being terrible at all within a well defined search depth, and again just as terrible beyond that. And only this outlier is the weird thing referenced in the headline.
I read this as that this outlier version is connecting to an engine, and that this engine happens to get parameterized for a not particularly deep search depth.
If it's an exercise in integration they don't need to waste cycles on the engine playing awesome - it's enough for validation if the integration result is noticeably less bad than the LLM alone rambling about trying to sound like a chess expert.
In this hypothetical, the cycles aren't being wasted on the engine, they're being wasted on running a 200b parameter LLM for longer than necessary in order to play chess badly instead of terribly. An engine playing superhuman chess takes a comparatively irrelevant amount of compute these days.
If it's fine up to a certain depth it's much more likely that it was trained on an opening book imo.
What nobody has bothered to try and explain with this crazy theory is why would OpenAI care to do this at enormous expense to themselves?
> If it's fine up to a certain depth it's much more likely that it was trained on an opening book imo.
Yeah, that thought crossed my mind as well. I dismissed that thought on the assumption that the measurements in the blog post weren't done from openings but from later stage game states, but I did not verify that assumption, I might have been wrong.
As for the insignificance of game cycles vs LLM cycles, sure. But if it's an integration experiment they might buy the chess API from some external service with a big disconnect between prices and cycle cost, or host one separately where they simply did not feel any need to bother with scaling mechanism if they can make it good enough for detection by calling with low depth parameters.
And the last uncertainty, here I'm much further out of my knowledge: we don't know how many calls to the engine a single promt might cause. Who knows how many cycles of "inner dialoge" refinement might run for a single prompt, and how often the chess engine might get consulted for prompts that aren't really related to chess before the guessing machine finally rejects that possibility. The amount of chess engine calls might be massive, big enough to make cycles per call a meaningful factor again.
That would have immediately given away that something must be off. If you want to do this in a subtle way that increases the hype around GPT-3.5 at the time, giving it a good-but-not-too-good rating would be the way to go.
If you want to keep adding conditions to an already-complex theory, you'll need an equally complex set of observations to justify it.
You're the one imposing an additional criterion, that OpenAI must have chosen the highest setting on a chess engine, and demanding that this additional criterion be used to explain the facts.
I agree with GP that if a 'fine tuning' of GPT 3.5 came out the gate playing at top Stockfish level, people would have been extremely suspicious of that. So in my accounting of the unknowns here, the fact that it doesn't play at the top level provides no additional information with which to resolve the question.
7 replies →