Comment by gs17
1 year ago
> For the OpenAI models he generated up to 10 different outputs until he got one that was legal, or just randomly chose a move if it failed.
I wonder how often they failed to generate a move. That feels like it could be a meaningful difference.
Gpt-3.5-turbo-instruct had something like 5(or less) illegal moves in 8205
https://github.com/adamkarvonen/chess_gpt_eval
I expect the rest to be much worse if 4's performance is any indication
And the most notable part of that:
> Most of gpt-4's losses were due to illegal moves
3.5-turbo-instruct definitely has some better chess skills.