Comment by _KnighT_

3 months ago

I'm new to this topic. Can someone help me understand this sentence?

"Meanwhile, through the next-token prediction constraint, the explicit textual symbols of the hidden representations for Heima Encoder are aligned to the text of the corresponding special tokens {<CoT>(k)} in vocabulary, while the hidden representations contained in hidden states of thinking tokens remain distinct and variable depending on the inputs"

I understand that they have fine-tuned the MLLM to produce, in response to each query and image input, the CoT "thinking tokens" in addition to the answer.

How does that establish an association between the thinking tokens and the original plain-English CoT statements?

The second clause seems to say that the thinking tokens encode information that is "distinct and variable depending on the inputs." Is my interpretation correct?