Comment by Moosdijk

2 months ago

Interesting. Instead of running the model once (flash) or multiple times (thinking/pro) in its entirety, this approach seems to apply the same principle within one run, looping back internally.

Instead of big models that “brute force” the right answer by knowing a lot of possible outcomes, this model seems to come to results with less knowledge but more wisdom.

Kind of like having a database of most possible frames in a video game and blending between them instead of rendering the scene.

7 comments

Moosdijk

omneity 2 months ago

Isn’t this in a sense an RNN built out of a slice of an LLM? Which if true means it might have the same drawbacks, namely slowness to train but also benefits such as an endless context window (in theory)

ctoa 2 months ago
It's sort of an RNN, but it's also basically a transformer with shared layer weights. Each step is equivalent to one transformer layer, the computation for n steps is the same as the computation for a transformer with n layers.
The notion of context window applies to the sequence, it doesn't really affect that, each iteration sees and attends over the whole sequence.
- omneity 2 months ago
  
  Thanks, this was helpful! Reading the seminal paper[0] on Universal Transformers also gave some insights:
  > UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs.
  Very interesting, it seems to be an “old” architecture that is only now being leveraged to a promising extent. Curious what made it an active area (with the works of Samsung and Sapient and now this one), perhaps diminishing returns on regular transformers?
  0: https://arxiv.org/abs/1807.03819

nl 2 months ago

> Instead of running the model once (flash) or multiple times (thinking/pro) in its entirety

I'm not sure what you mean here, but there isn't a difference in the number of times a model runs during inference.

Moosdijk 2 months ago
I meant going to the likeliest output (flash) or (iteratively) generating multiple outputs and (iteratively) choosing the best one (thinking/pro)
- nl 2 months ago
  
  That's not how these models work.
  Thinking models produce thinking tokens to reason out the answer.