Comment by nullc

3 months ago

Keeping the thinking interpretable makes it easier to impose conditions on it both at runtime and as part of reinforcement. It opens the doors to manually injecting relevant thoughts triggered by supervision ("I must remember to say nothing that could offend the party.", search results, or access to APIs like calculators).

Those advantages are easily worth some efficiency.

I'm skeptical of the safety/security arguments some have made. Models RL trained seeing their own COT may (and in fact almost certainly) will develop hidden context embedded into their word choices that carry through data that we're not aware of, the fact that the COT appears to be English (or some other human language) doesn't mean that we necessarily really understand it.

Consider how a game of Hanabi between long time partners might look to an outsider.