Comment by Galanwe
7 hours ago
> You would need to test 100 or more of such puzzles, widely spread across the puzzle spectrum
Would you? I am not very knowledgable on LLMs, but my understanding was that each query was essentially a stateless inference with previous input/output as context. In such a case, a single puzzle, yielding hundreds of queries, is essentially hundreds of paths dependent but individual tests?
From what I understood, it's a coding challenge: the models wrote a player for that specific word game. E.g. https://github.com/rayonnant-ai/aicc/blob/main/wordgempuzzle...
Generally speaking, would you take a conclusion based only an event that happened once?