Comment by quantadev

1 year ago

All the trainable parameters are just slopes of lines tho. Training NNs doesn't involve adjusting any inputs to non-linear functions. The tanh smashing function just makes sure nothing can blow up into large numbers and all outputs are in a range of less than 1. There's no "magic" or "knowledge" in the tanh smashing. All the magic is 100% in the weights. They're all linear. The amazing thing is that all weights are linear slopes of lines.

Simply squashing the output of a linear signal would be multiplying by a small value. To avoid large y, you add a step y' = y/1000.

That would still be linear. And the result would be that despite squashing, no matter how many layers a model had, it could only fit linear problems. Which can always be fit with a single layer, i.e. single matrix.

So nobody does that.

The nonlinearity doesn't just squash some inputs. But create a new rich feature, decision making. That's because on one side of a threshold y gets converted very differently than another. I.e if y > 0, y' = y, otherwise y = 0.

Now you have a discontinuity in behavior, you have a decision.

Multiple layers making decisions can do far more than a linear layer. They can fit any continuous function (or any function with a finite number of discontinuities) arbitrarily well.

Non-linearities add a fundamental new feature. You can think of that features as being able to make decisions around the non-linear function's decision points.

---

If you need to prove this to yourself with a simple example, try to create an XOR gate with this function:

    y = w1 * x1 + w2 * x2 + b.

Where you can pick w1, w2 and b.

You are welcome to linearly squash the output, i.e. y' = y * w3, for whatever small w3 you like. It won't help.

Layers with non-linear transformations are layers of decision makers.

Layers of linear transforms are just unnecessarily long ways of writing a single linear transform. Even with linear "squashing".

  • Right, it's obvious that the ReLU is just a gating mechanism, and you can think of that as a decision maker. It's like a "pass thru linearly proportionally" or "block" function.

    But I still find it counter-intuitive that it's not common practice in standard LLM NNs to have a trainable parameter that in some way directly "tunes" whatever Activation Function is being applied on EACH output.

    For example I almost started experimenting with trigonometric activation functions in a custom NN where the phase angle would be adjusted, inspired by Fourier Series. I can envision a type of NN where every model "weight" is actually a frequency component, because Fourier Series can represent any arbitrary function in this way. There has of course already been similar research done by others along these lines.

> The tanh smashing function just makes sure nothing can blow up into large numbers and all outputs are in a range of less than 1.

That's not the main point even though it probably helps. As OkayPhysicist said above, without a nonlinearity, you could collapse all the weight matrices into a single matrix. If you have 2 layers (same size, for simplicity) described by weight matrices A and B, you could multiply them and get C, which you could use for inference.

Now, you can do this same trick not only with 2 layers but 100 million, all collapsing into a single matrix after multiplication. If the nonlinearities weren't there, the effective information content of the whole NN would collapse into that of a single-layer NN.

  • You can explain the "effect" of tanh at any level of abstraction you like, up to including describing things that happen in Semantic Space itself, but my description of what tanh is doing is 100% accurate in the context I used it. All it's doing is squashing a number down to below one. My understanding of how the Perceptron works is fully correct, and isn't missing any details. I've implemented many of them.

    • Your description of tanh isn't even correct, it squashes a real number to `(-1, 1)`, not "less than one".

      You're curious about whether there is gain in parameterising activation functions and learning them instead, or rather, why it's not used much in practice. That's an interesting and curious academic question, and it seems like you're already experimenting with trying out your own kinds of activation functions. However, people in this thread (including myself) wanted to clarify some perceived misunderstandings you had about nonlinearities and "why" they are used in DNNs. Or how "squashing functions" is a misnomer because `g(x) = x/1000` doesn't introduce any nonlinearities. Yet you continue to fixate and double down on your knowledge of "what" a tanh is, and even that is incorrect.

      1 reply →