Comment by jebarker

7 months ago

The same idea is in both the R1 and S1 papers (<think> tokens are used similarly). Basically they're using special tokens to mark in the prompt where the LLM should think more/revise the previous response. This can be repeated many times until some stop criteria occurs. S1 manually inserts these with heuristics, R1 learns the placement through RL I think.

3 comments