Comment by famouswaffles

2 years ago

>The training and loss function of the foundation models are completely oriented towards next-token accuracy.

This doesn't mean anything.

Loss function and training only concern themselves with the result of the prediction. The in-between, the computation, training does not care except as a means to an end.

It's not Input A > Output B. It's Input A > Computation > Output B.

That Computation could quite literally be anything. And no, we do not automatically know what this computation might represent or if it even represents anything that would be understandable by us.

If you train a Transformer meticulously on predicting the token that is the result of an addition, you might hope the computation it learns is some algorithm for addition but you wouldn't actually know until you attempted to and successfully probed the model.

But in an LLM it is not an arbitrary computation. Very specifically, it is a single forward pass through a neural network.

Neural networks are very general function approximators so yes, there is some room for emergent behavior. But it could _not_ be "quite literally anything." It's plugging in values for a single (very big) equation.

I think we do ourselves a disservice by pretending it's more of a black box than it is.

  • How many passes it is irrelevant. You can perform any computation you like in a single pass if you have enough compute time.

    Trained transformers have limited computer time per token so each query is compute limited, but this is trivially increased, by increasing tokens, or by increasing dimensions in the next training round so that each token permits more compute time.

    A forward pass is not one big equation and I have no clue who you think it is. It's a series of computations, computations that depends on the query awaiting prediction. It's not even the same series of computations for each query because not all neurons are getting activated period and even when the same neurons get activated, they are not necessarily getting activated in the same way.

    • > You can perform any computation you like in a single pass if you have enough compute time.

      You can't perform _any_ computation. A single forward pass through a neutral network can perform many classes of computation, and it can _approximate_ all... but that's not a guarantee that the approximation will be good (and there's classes for which the approximation is pretty much guaranteed to be bad).

      9 replies →