Comment by simianwords
4 hours ago
> To some extent. It's not clear where specifically the boundaries are, but it seems to fail to approach problems in ways that aren't embedded in the training set. I certainly would not put money on it solving an arbitrary logical problem.
In what way can you falsify this without having the LLM be omniscient? We have examples of it solving things that are not in the training set - it found vulnerabilities in 25 year old BSD code that was unspotted by humans. It was not a trivial one either.
Here's an odd example of testing, but I design very complex board and card games, and LLMs are terrible at figuring out whether they make sense or really even restating the rules in a different wording.
I thought they would be ideal for the job, until I realized that it would just pretend that the rules worked because they looked like board game rules. The more you ask it to restate, manipulate or simulate the rules, the more you can tell that it's bluffing. It literally thinks every complicated set of rules works perfectly.
> it found vulnerabilities in 25 year old BSD code that was unspotted by humans.
I don't think the age of the code makes the problem more complex. Finding buffers that are too small is not rocket science, bothering to look at some corner of some codebase that you've never paid attention to or seen a problem with is. AI being infinitely useful (cheap) to sic on pieces of codebase nobody ever carefully looks at is a great thing. It's not genius on the part of the AI.
> Here's an odd example of testing, but I design very complex board and card games, and LLMs are terrible at figuring out whether they make sense or really even restating the rules in a different wording.
I'm positive that they are perfectly fine and will a pretty good job. Did you actually try it?
Eh, I can see their point, I think. The models can restate the rules differently, I'm sure, but it sounds like the GP is saying that LLMs can't tell whether the rules are well-balanced.
It would be interesting to see some example problems along those lines. Design some games with complex rules, including one or two of the most subtle game-wrecking bugs you can think of, and ask the models if they can spot them.
In fact that sounds more interesting the more I think about it. Intensive RL on that sort of thing might generalize in... let's say useful ways.