Comment by eterm

19 days ago

This makes me think LLMs would be interesting to set up in a game of Diplomacy, which is an entirely text-based game which soft rather than hard requires a degree of backstabbing to win.

The findings in this game that the "thinking" model never did thinking seems odd, does the model not always show it's thinking steps? It seems bizarre that it wouldn't once reach for that tool when it must be being bombarded with seemingly contradictory information from other players.

5 comments

eterm

qbit42 19 days ago

https://noambrown.github.io/papers/22-Science-Diplomacy-TR.p...

eterm 19 days ago

Thanks, it would be fascinating to repeat that today, a lot has changed since 2022 especially with respect to consistency of longer term outcomes.

open-paren 19 days ago

It’s been done before

https://every.to/diplomacy (June 2025)

eterm 19 days ago

Reading more I'm a little disappointed that the write-up has seemingly leant so heavily on LLMs too, because it detracts credibility from the study itself.

lout332 19 days ago

Fair point. The core simulation and data collection was done programmatically - 162 games, raw logs, win rates. The analysis of gaslighting phrases and patterns was human-reviewed. I used LLMs to help with the landing page copy, which I should probably disclose more clearly. The underlying data and methodology is solid, you can check it here: https://github.com/lout33/so-long-sucker