Comment by HarHarVeryFunny

8 months ago

It's different because a chat model has been post-trained for chat, while o1/o3 have been post-trained for reasoning.

Imagine trying to have a conversation with someone who's been told to assume that they should interpret anything said to them as a problem they need to reason about and solve. I doubt you'd give them high marks for conversational skill.

Ideally one model could do it all, but for now the tech is apparently being trained using reinforcement learning to steer the response towards a singular training goal (human feedback gaming, or successful reasoning).

2 comments

HarHarVeryFunny

refulgentis 8 months ago

TFA, and my response, are about a de novo relationship between task completion and input prompt. Not conversational skill.

HarHarVeryFunny 8 months ago

Yes, and the "de novo" explanation appears obvious as indicated - the model was trained differently - different reinforcement learning goals (reasoning vs human feedback for chat). The necessity for different prompting derives from the different operational behavior of a model trained in this way (to support self-evaluation based on the data present in the prompt, backtracking when veering away from the goals established in the prompt, etc - the handful of reasoning behaviors that have been baked into the model via RL).