Comment by code_biologist

4 days ago

Whether it's actually correct, whether it works, I don't think that's even a concept in these systems.

I'm not an expert, but it is a concept in these systems. Check out some videos on Deepseek's R1 paper. In particular there's a lot they did to incentivize the chain-of-thought reasoning process towards correct answers in "coding, mathematics, science, and logic reasoning" during reinforcement learning. I presume basically all the state of the art CoT reasoning models have some similar "correct and useful reasoning" portion in their RL tuning. This explains why models are getting better at math and code, but not as much at creative writing. As I understand it, everybody is pretty data limited, but it's much easier to generate synthetic training data where there is a right answer than it is to make good synthetic creative writing. It's also much easier to check that the model is answering those problems correctly during training, rather than waiting for human feedback via RLHF.

It seems that OpenAI forgot to make sure their critic model punished o3 for being wrong it claimed it had a laptop, lol.