← Back to context

Comment by simianwords

6 hours ago

> Here's an odd example of testing, but I design very complex board and card games, and LLMs are terrible at figuring out whether they make sense or really even restating the rules in a different wording.

I'm positive that they are perfectly fine and will a pretty good job. Did you actually try it?

Eh, I can see their point, I think. The models can restate the rules differently, I'm sure, but it sounds like the GP is saying that LLMs can't tell whether the rules are well-balanced.

It would be interesting to see some example problems along those lines. Design some games with complex rules, including one or two of the most subtle game-wrecking bugs you can think of, and ask the models if they can spot them.

In fact that sounds more interesting the more I think about it. Intensive RL on that sort of thing might generalize in... let's say useful ways.