Comment by MrCheeze
6 hours ago
The Claude Plays Pokemon stream with a minimal harness is a far more significant test of model intelligence compared to the Gemini Plays Pokemon stream (which automatically maintains a map of everything that has been seen on the current map) and the GPT Plays Pokemon stream (which does that AND has an extremely detailed prompt which more or less railroads the AI into not making this mistakes it wants to make). The latter two harnesses have become too easy for the latest generations of model, enough so that they're not really testing anything anymore.
Claude Plays Pokemon is currently stuck in Victory Road, doing the Sokoban puzzles which are both the last puzzles in the game and by far the most difficult for AIs to do. Opus 4.5 made it there but was completely hopeless, 4.6 made it there and is is showing some signs of maaaaaybe being eventually bruteforce through the puzzles, but personally I think it will get stuck or undo its progress, and that Claude 4.7 or 5 will be the one to actually beat the game.
No comments yet
Contribute on Hacker News ↗