Comment by antirez

3 months ago

Cool, but isn't this encoding a potentially very long thinking process into a fixed embedding? Intuitively should not work as well.

But don't words have a fixed size embedding? Causal models create a sequence of attended word embeddings. It's only about 300 dimensions per one word, so it seems counterintuitive you can compress an entire reasoning chain into such a small vector.

That's already the case with visible text. There's an embedding inside the model as it spits out the next token.

  • Sure, but you have multiple thoughts tokens in the context the model sees to process the next token.