← Back to context

Comment by tgv

8 hours ago

This may be objectively scored, but it is not an indication of anyone's coding capabilities. This test measures which model almost accidentally came up with the best strategy (against other bots). This is not representative of coding. You would need to test 100 or more of such puzzles, widely spread across the puzzle spectrum, to get an idea which model is best at finding strategies involving an English dictionary.

We do use hundreds of different environments across disciplines, and it seems to line up with subjective experience better than other benchmarks. We test coding, agentic coding, and non-coding reasoning in the environments.

I don't think that is entirely fair.. I don't see them stating anywhere they are measuring coding capabilities... "Using complex games to probe real intelligence."

And this seems very much in line with the methodology in ARC-AGI-3.

The results here, in the OP article and in https://www.designarena.ai all tell a similar story: Kimi K2.6 is up and in the SOTA mix.

  • The task was writing a "bot" to play the game. The title is "Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge." How does that not imply measuring coding capabilities?

> You would need to test 100 or more of such puzzles, widely spread across the puzzle spectrum

Would you? I am not very knowledgable on LLMs, but my understanding was that each query was essentially a stateless inference with previous input/output as context. In such a case, a single puzzle, yielding hundreds of queries, is essentially hundreds of paths dependent but individual tests?