Comment by swiftcoder

1 year ago

I feel like the article neglects one obvious possibility: that OpenAI decided that chess was a benchmark worth "winning", special-cases chess within gpt-3.5-turbo-instruct, and then neglected to add that special-case to follow-up models since it wasn't generating sustained press coverage.

74 comments

swiftcoder

vimbtw 1 year ago

This is exactly it. Here’s the pull request where chess evals were added: https://github.com/openai/evals/pull/45.

scott_w 1 year ago

I suspect the same thing. Rather than LLMs “learning to play chess,” they “learnt” to recognise a chess game and hand over instructions to a chess engine. If that’s the case, I don’t feel impressed at all.

gamerDude 1 year ago
This is exactly what I feel AI needs. A manager AI that then hands off things to specialized more deterministic algorithms/machines.
- bigiain 1 year ago
  
  Next thing, the "manager AIs" start stack ranking the specialized "worker AIs".
  And the worker AIs "evolve" to meet/exceed expectations only on tasks directly contributing to KPIs the manager AIs measure for - via the mechanism of discarding the "less fit to exceed KPIs".
  And some of the worker AIs who're trained on recent/polluted internet happen to spit out prompt injection attacks that work against the manager AIs rank stacking metrics and dominate over "less fit" worker AIs. (Congratulations, we've evolved AI cancer!) These manager AIs start performing spectacularly badly compared to other non-cancerous manager AIs, and die or get killed off by the VC's paying for their datacenters.
  Competing manager AIs get training, perhaps on on newer HN posts discussing this emergent behavior of worker AIs, and start to down rank any exceptionally performing worker AIs. The overall trends towards mediocrity becomes inevitable.
  Some greybread writes some Perl and regexes that outcompete commercial manager AIs on pretty much every real world task, while running on a 10 year old laptop instead of a cluster of nuclear powered AI datacenters all consuming a city's worth of fresh drinking water.
  Nobody in powerful positions care. Humanity dies.
  
  1 reply →
- criley2 1 year ago
  
  Basically what Wolfram Alpha rolled out 15 years ago.
  It was impressive then, too.
  
  1 reply →
- waffletower 1 year ago
  
  While deterministic components may be a left-brain default, there is no reason that such delegate services couldn't be more specialized ANN models themselves. It would most likely vastly improve performance if they were evaluated in the same memory space using tensor connectivity. In the specific case of chess, it is helpful to remember that AlphaZero utilizes ANNs as well.
- spiderfarmer 1 year ago
  
  Multi Agent LLM's are already a thing.
  
  1 reply →
Kiro 1 year ago
That's something completely different than what the OP suggests and would be a scandal if true (i.e. gpt-3.5-turbo-instruct actually using something else behind the scenes).
- nerdponx 1 year ago
  
  Ironically it's probably a lot closer to what a super-human AGI would look like in practice, compared to just an LLM alone.
  
  8 replies →
- empath75 1 year ago
  
  The point of creating a service like this is for it to be useful, and if recognizing and handing off tasks to specialized agents isn't useful, i don't know what is.
  
  3 replies →
- cruffle_duffle 1 year ago
  
  If they came out and said it, I don’t see the problem. LLM’s aren’t the solution for a wide range of problems. They are a new tool but not everything is a nail.
  I mean it already hands off a wide range of tasks to python… this would be no different.
antifa 1 year ago
TBH I think a good AI would have access to a Swiss army knife of tools and know how to use them. For example a complicated math equation, using a calculator is just smarter than doing it in your head.
- PittleyDunkin 1 year ago
  
  We already have the chess "calculator", though. It's called stockfish. I don't know why you'd ask a dictionary how to solve a math problem.
  
  7 replies →
fires10 1 year ago
Recognize and hand over to a specialist engine? That might be useful for AI. Maybe I am missing something.
- worewood 1 year ago
  
  It's because this is standard practice since the early days - there's nothing newsworthy in this at all.
- generic92034 1 year ago
  
  How do you think AI are (correctly) solving simple mathematical questions which they have not trained for directly? They hand it over to a specialist maths engine.
  
  2 replies →
- nerdponx 1 year ago
  
  It is and would be useful, but it would be quite a big lie to the public, but more importantly to paying customers, and even more importantly to investors.
  
  1 reply →
- skydhash 1 year ago
  
  Wasn't that the basis of computing and technology in general? Here is one tedious thing, let's have a specific tool that handles it instead of wasting time and efforts. The fact is that properly using the tool takes training and most of current AI marketing are hyping that you don't need that. Instead, hand over the problem to a GPT and it will "magically" solve it.
- scott_w 1 year ago
  
  If I was sold a general AI problem solving system, I’d feel ripped off if I learned that I needed to build my own problem solver and hook it up after I’d paid my money…
kazinator 1 year ago
That's not much different from a compiler being rigged to recognize a specific benchmark program and spit out a canned optimization.
- Peteragain 1 year ago
  
  .. or a Volkswagen recognising an emissions test and turning off power mode...

dmurray 1 year ago

This seems quite likely to me, but did they special case it by reinforcement training it into the LLM (which would be extremely interesting in how they did it and what its internal representation looks like) or is it just that when you make an API call to OpenAI, the machine on the other end is not just a zillion-parameter LLM but also runs an instance of Stockfish?

shaky-carrousel 1 year ago
That's easy to test, invent a new chess variant and see how the model does.
- andy_ppp 1 year ago
  
  You're imagining LLMs don't just regurgitate and recombine things they already know from things they have seen before. A new variant would not be in the dataset so would not be understood. In fact this is quite a good way to show LLMs are NOT thinking or understanding anything in the way we understand it.
  
  14 replies →
- dmurray 1 year ago
  
  In both scenarios it would perform poorly on that.
  If the chess specialization was done through reinforcement learning, that's not going to transfer to your new variant, any more than access to Stockfish would help it.
- gliptic 1 year ago
  
  Both an LLM and Stockfish would fail that test.
  
  3 replies →

INTPenis 1 year ago

Of course it's a benchmark worth winning, has been since Watson. And before that even with mechanical turks.

amelius 1 year ago

To be fair, they say

> Theory 2: GPT-3.5-instruct was trained on more chess games.

AstralStorm 1 year ago
If that were the case, pumping big Llama chock full of chess games would produce good results. It didn't.
The only way it could be true is if that model recognized and replayed the answer to the game from memory.
- yorwba 1 year ago
  
  Do you have a link to the results from fine-tuning a Llama model on chess? How do they compare to the base models in the article here?

jackcviers3 1 year ago

Why couldn't they add a tool that literally calls stockfish or a chess ai behind the scenes with function calling and buffer the request before sending it back to the endpoint output interface?

As long as you are training it to make a tool call, you can add and remove anything you want behind the inference endpoint accessible to the public, and then you can plug the answer back into the chat ai, pass it through a moderation filter, and you might get good output from it with very little latency added.

bambax 1 year ago

Yes, came here to say exactly this. And it's possible this specific model is "cheating", for example by identifying a chess problem and forwarding it to a chess engine. A modern version of the Mechanical Turk.

That's the problem with closed models, we can never know what they're doing.

oezi 1 year ago

Maybe they even delegate it to a chess engine internally via the tool use and the LLM uses that.