Comment by HarHarVeryFunny

12 hours ago

> A more interesting question is, how would you do at a math competition if you were taught to read, then left alone in your room with a bunch of math books?

But that isn't how an LLM learnt to solve math olympiad problems. This isn't a base model just trained on a bunch of math books.

The way they get LLMs to be good at specialized things like math olympiad problems is to custom train them for this using reinforcement learning - they give the LLM lots of examples of similar math problems being solved, showing all the individual solution steps, and train on these, rewarding the model when (due to having selected an appropriate sequence of solution steps) it is able itself to correctly solve the problem.

So, it's not a matter of the LLM reading a bunch of math books and then being expert at math reasoning and problem solving, but more along the lines "of monkey see, monkey do". The LLM was explicitly shown how to step by step solve these problems, then trained extensively until it got it and was able to do it itself. It's probably a reflection of the self-contained and logical nature of math that this works - that the LLM can be trained on one group of problems and the generalizations it has learnt works on unseen problems.

The dream is to be able to teach LLMs to reason more generally, but the reasons this works for math don't generally apply, so it's not clear that this math success can be used to predict future LLM advances in general reasoning.

The dream is to be able to teach LLMs to reason more generally, but the reasons this works for math don't generally apply

Why is that? Any suggestions for further reading that justifies this point?

Ultimately, reinforcement learning is still just a matter of shoveling in more text. Would RL work on humans? Why or why not? How similar is it to what kids are exposed to in school?

  • An important difference between reinforcement learning (RL) and pre-training is the error feedback that is given. For pre-training the error feedback is just next token prediction error. For RL you need to have a goal in mind (e.g. successfully solving math problems) and the training feedback that is given is the RL "reward" - a measure of how well the model output achieved the goal.

    With RL used for LLMs, it's the whole LLM response that is being judged and rewarded (not just the next word), so you might give it a math problem and ask it to solve it, then when it was finished you take the generated answer and check if it is correct or not, and this reward feedback is what allows the RL algorithm to learn to do better.

    There are at least two problems with trying to use RL as a way to improve LLM reasoning in the general case.

    1) Unlike math (and also programming) it is not easy to automatically check the solution to most general reasoning problems. With a math problem asking for a numerical answer, you can just check against the known answer, or for a programming task you can just check if the program compiles and the output is correct. In contrast, how do you check the answer to more general problems such "Should NATO expand to include Ukraine?" ?! If you can't define a reward then you can't use RL. People have tried using "LLM as judge" to provide rewards in cases like this (give the LLM response to another LLM, and ask it if it thinks the goal was met), but apparently this does not work very well.

    2) Even if you could provide rewards for more general reasoning problems, and therefore were able to use RL to train the LLM to generate good solutions for those training examples, this is not very useful unless the reasoning it has learnt generalizes to other problems it was not trained on. In narrow logical domains like math and programming this evidentially works very well, but it is far from clear how learning to reason about NATO will help with reasoning about cooking or cutting your cat's nails, and the general solution to reasoning can't be "we'll just train it on every possible question anyone might ever ask"!

    I don't have any particular reading suggestions, but these are widely accepted limiting factors to using RL for LLM reasoning.

    I don't think RL for humans would work too well, and it's not generally the way we learn, or kids are mostly taught in school. We mostly learn or are taught individual skills and when they can be used, then practice and learn how to combine and apply them. The closest to using RL in school would be if the only feedback an English teacher gave you on your writing assignments was a letter grade, without any commentary, and you had to figure out what you needed to improve!