← Back to context

Comment by quantadev

1 year ago

> What is a "linear weight"?

In the context of discussing linearity v.s. non-linearity adding the word "linear" in front of "weight" is more clear, which is what my top level post on this thread was all about too.

It's astounding to me (and everyone else who's being honest) that LLMs can accomplish what they do when it's only linear "factors" (i.e. weights) that are all that's required to be adjusted during training, to achieve genuine reasoning. During training we're not [normally] adjusting any parameters or weights on any non-linear functions. I include the caveat "normally", because I'm speaking of the basic Perceptron NN using a squashing-type activation function.

> It's astounding to me (and everyone else who's being honest) that LLMs can accomplish what they do when it's only linear "factors" (i.e. weights) that are all that's required to be adjusted during training, to achieve genuine reasoning.

When such basic perceptrons are scaled enormously, it becomes less surprising that they can achieve some level of 'genuine reasoning' (e.g., accurate next-word prediction), since the goal with such networks at the end of the day is just function approximation. What is more surprising to me is how we found ways to train such models i.e., advances in hardware accelerators, combined with massive data, which are factors just as significant in my opinion.

  • Yeah, no one is surprised that LLMs do what they're trained to do: predict tokens. The surprise comes from the fact that merely training to predict tokens ends up with model weights that generate emergent reasoning.

    If you want to say reasoning and token prediction are just the same thing at scale you can say that, but I don't fall into that camp. I think there's MUCH more to learn, and indeed a new field of math or even physics that we haven't even discovered yet. Like a step change in mathematical understanding analogous to the invention of Calculus.