Comment by jebarker
3 months ago
The same idea is in both the R1 and S1 papers (<think> tokens are used similarly). Basically they're using special tokens to mark in the prompt where the LLM should think more/revise the previous response. This can be repeated many times until some stop criteria occurs. S1 manually inserts these with heuristics, R1 learns the placement through RL I think.
? theyre not special tokens really
i'm not actually sure whether they're special tokens in the sense of being in the vocabulary
<think> might be i think "wait" is tokenized like any other in the pretraining