Comment by aidenn0

1 month ago

I just tried to play and the chatter didn't match the game-play (e.g. "Good capture Yellow" when yellow didn't just capture... Yellow said they were going to capture, and had a legal capture but started a new pile instead.

[edit]

I won without a single one of my chips being killed. This was only because the moves they actually made didn't match the moves the announced (i.e. they missed several capture possibilities), the overwhelming majority (but not all) of plays were to start new piles.

[edit 2]

Looking over the logs, the chatter could imply that their internal state was out of sync with the game. E.g. "Yellow has 3 prisoners now" after Yellow played a new pile when the y could have gotten 3 prisoners and indeed stated that they were taking that pile.

6 comments

aidenn0

stavros 1 month ago

I think the game is bugged. I placed a green chip on another green chip and it didn't capture, and when I asked about it, the LLMs said the bottom chip was yellow, not green.

There seem to be some state management issues, which make this game fairly unplayable. Too bad, because it's an interesting idea.

aidenn0 1 month ago
Llama seems to make illegal moves which confuses the game engine; it tries to play to non-existant piles which causes the chips to disappear (not end up in the Dead box). This then confuses other AIs which are counting chips in the dead box and on the board.
Even were that fixed, that doesn't solve the problem that the AI makes really bad moves. I can win just by doing the following:
1. If there is a pile that I can capture with at least one chip not of my color, do it
2. Otherwise play on the largest pile
- stavros 1 month ago
  
  Well, really bad moves are still better than illegal moves. I'm not sure why the engine allows itself to be confused by illegal moves, rather than just... disallowing them.
  
  2 replies →

lout332 1 month ago

You're right about the state sync issues with some models. The lighter models (especially Llama) struggle with tracking game state. I've added more Gemini options which handle this better. The research data used controlled AI-vs-AI runs where we could validate state consistency.