← Back to context

Comment by OkayPhysicist

1 year ago

The "squashing function" necessarily is nonlinear in multilayer nueral networks. A single layer of a neural network can be quite simply written a weight matrix, times an input vector, equalling an output vector, like so

Ax = y

Adding another layer is just multiplying a different set of weights times the output of the first, so

B(Ax)= y

If you remember your linear algebra course, you might see the problem: that can be simplified

(BA)x = y

Cx = y

Completely indistinguishable from a single layer, thus only capable of modeling linear relationships.

To prevent this collapse, a non linear function must be introduced between each layer.

Right. All the squashing is doing is keeping the output of any neuron in a range of below 1.

But the entire NN itself (Perceptron ones, which most LLMs are) is still completely using nothing but linearity to store all the knowledge from the training process. All the weights are just an 'm' in the basic line equation 'y=m*x+b'. The entire training process does nothing but adjust a bunch of slopes of a bunch of lines. It's totally linear. No non-linearity at all.

  • The non linearities are fundamental. Without them, any arbitrarily deep NN is equivalent to a shallow NN (easily computable, as GP was saying), and we know those can't even solve the XOR problem.

    > nothing but linearity

    No, if you have non linearities, the NN itself is not linear. The non linearities are not there primarily to keep the outputs in a given range, though that's important, too.

    • Nonlinearity somewhere is fundamental, but it doesn't need to be between each layer. You can, for instance, project each input to a higher dimensional space with a nonlinearity, and the problem becomes linearly separable with high probability (cf Cover's Theorem).

      So, for XOR, (x, y) -> (x, y, xy), and it becomes trivial for a linear NN to solve.

      Architectures like Mamba have a linear recurrent state space system as their core, so even though you need a nonlinearity somewhere, it doesn't need to be pervasive. And linear recurrent networks are surprisingly powerful (https://arxiv.org/abs/2303.06349, https://arxiv.org/abs/1802.03308).

    • > The non linearities are not there primarily to keep the outputs in a given range

      Precisely what the `Activation Function` does is to squash an output into a range (normally below one, like tanh). That's the only non-linearity I'm aware of. What other non-linearities are there?

      All the training does is adjust linear weights tho, like I said. All the training is doing is adjusting the slopes of lines.

      26 replies →