Comment by HarHarVeryFunny

1 year ago

It's cheating to the extent that it misrepresents the strength and reasoning ability of the model, to the extent that anyone is going to look at it's chess playing results and incorrectly infer this says anything about how good the model is.

The takeaway here is that if you are evaluating different models for your own use case, the only indication of how useful each may be is to test it on your actual use case, and ignore all benchmarks or anything else you may have heard about it.

It represents the reasoning ability of the model to correctly choose and use a tool... Which seems more useful than a model that can do chess by itself but when you need it to do something else, it keeps playing chess.

  • Where it’ll surprise people is if they don’t realize it’s using an external tool and expect it to be able to find solutions of similar complexity to non-chess problems, or if they don’t realize this was probably a special case added to the program and that this doesn’t mean it’s, like, learned how to go find and use the right tool for a given problem in a general case.

    I agree that this is a good way to enhance the utility of these things, though.

  • It doesn't take much to recognize a sequence of chess moves. A regex could do that.

    If what you want is intelligence and reasoning, there is no tool for that - LLMs are as good as it gets for now.

    At the end of the day it either works on your use case, or it doesn't. Perhaps it doesn't work out of the box but you can code an agent using tools and duct tape.

    • Do you really think it's feasible to maintain and execute a set of regexes for every known problem every time you need to reason about something? Welcome to the 1970s AI winter...

      3 replies →