Comment by Nevermark

1 year ago

Simply squashing the output of a linear signal would be multiplying by a small value. To avoid large y, you add a step y' = y/1000.

That would still be linear. And the result would be that despite squashing, no matter how many layers a model had, it could only fit linear problems. Which can always be fit with a single layer, i.e. single matrix.

So nobody does that.

The nonlinearity doesn't just squash some inputs. But create a new rich feature, decision making. That's because on one side of a threshold y gets converted very differently than another. I.e if y > 0, y' = y, otherwise y = 0.

Now you have a discontinuity in behavior, you have a decision.

Multiple layers making decisions can do far more than a linear layer. They can fit any continuous function (or any function with a finite number of discontinuities) arbitrarily well.

Non-linearities add a fundamental new feature. You can think of that features as being able to make decisions around the non-linear function's decision points.

---

If you need to prove this to yourself with a simple example, try to create an XOR gate with this function:

    y = w1 * x1 + w2 * x2 + b.

Where you can pick w1, w2 and b.

You are welcome to linearly squash the output, i.e. y' = y * w3, for whatever small w3 you like. It won't help.

Layers with non-linear transformations are layers of decision makers.

Layers of linear transforms are just unnecessarily long ways of writing a single linear transform. Even with linear "squashing".

Right, it's obvious that the ReLU is just a gating mechanism, and you can think of that as a decision maker. It's like a "pass thru linearly proportionally" or "block" function.

But I still find it counter-intuitive that it's not common practice in standard LLM NNs to have a trainable parameter that in some way directly "tunes" whatever Activation Function is being applied on EACH output.

For example I almost started experimenting with trigonometric activation functions in a custom NN where the phase angle would be adjusted, inspired by Fourier Series. I can envision a type of NN where every model "weight" is actually a frequency component, because Fourier Series can represent any arbitrary function in this way. There has of course already been similar research done by others along these lines.