Comment by quantadev

1 year ago

Most LLMs aren't even using a "curve" yet at all, right? All they're using is a series of linear equations because the model weights are a simple multiply and add (i.e. basic NN Perceptron). Sure there's a squashing function on the output to keep it in a range from 0 to 1 but that's done BECAUSE we're just adding up stuff.

I think probably future NNs will be maybe more adaptive than this perhaps where some Perceptrons use sine wave functions, or other kinds of math functions, beyond just linear "y=mx+b"

It's astounding that we DID get the emergent intelligence from just doing this "curve fitting" onto "lines" rather than actual "curves".

The "squashing function" necessarily is nonlinear in multilayer nueral networks. A single layer of a neural network can be quite simply written a weight matrix, times an input vector, equalling an output vector, like so

Ax = y

Adding another layer is just multiplying a different set of weights times the output of the first, so

B(Ax)= y

If you remember your linear algebra course, you might see the problem: that can be simplified

(BA)x = y

Cx = y

Completely indistinguishable from a single layer, thus only capable of modeling linear relationships.

To prevent this collapse, a non linear function must be introduced between each layer.

  • Right. All the squashing is doing is keeping the output of any neuron in a range of below 1.

    But the entire NN itself (Perceptron ones, which most LLMs are) is still completely using nothing but linearity to store all the knowledge from the training process. All the weights are just an 'm' in the basic line equation 'y=m*x+b'. The entire training process does nothing but adjust a bunch of slopes of a bunch of lines. It's totally linear. No non-linearity at all.

    • The non linearities are fundamental. Without them, any arbitrarily deep NN is equivalent to a shallow NN (easily computable, as GP was saying), and we know those can't even solve the XOR problem.

      > nothing but linearity

      No, if you have non linearities, the NN itself is not linear. The non linearities are not there primarily to keep the outputs in a given range, though that's important, too.

      28 replies →

> It's astounding that we DID get the emergent intelligence from just doing this "curve fitting" onto "lines" rather than actual "curves".

In Ye Olden days (the 90’s) we used to approximate non-linear models using splines or seperate slopes models - fit by hand. They were still linear, but with the right choice of splines you could approximate a non-linear model to whatever degree of accuracy you wanted.

Neural networks “just” do this automatically, and faster.

  • In college (BSME) I wrote a computer program to generate cam profiles from Bezier curves. It's just a programming trick to generate curves from straight lines at any level of accuracy you want just by letting the computer take smaller and smaller steps.

    It's an interesting concept to think of how NNs might be able to exploit this effect in some way based on straight lines in the weights, because a very small number of points can identify avery precise and smooth curves, where directions on the curve might equate to Semantic Space Vectors.

    • In fact now that I think about it, for any 3 or more points in Semantic Space, there would necessarily be a "Bezier Path" which would have genuine meaning at every point as a good smooth differentiable path thru higher dimensional space to get from one point to another point while "visiting" all intermediate other points. This has to have a direct use in LLMs in terms of reasoning.