Comment by wrs

1 year ago

With a ReLU activation function, rather than a simple linear function of the inputs, you get a piecewise linear approximation of a nonlinear function.

ReLU enables this by being nonlinear in a simple way, specifically by outputting zero for negative inputs, so each linear unit can then limit its contribution to a portion of the output curve.

(This is a lot easier to see on a whiteboard!)

ReLU technically has a non-linearity at zero, but in some sense it's still even MORE linear than tanh or sigmoid, so it just demonstrates even better than tanh-type squashing that all this LLM stuff is being done ultimately with straight line math. All a ReLU function does is choose which line to use, a sloped one or a zero one.

  • Well. The word “linear” the way you use it doesn’t seem to have any particular meaning, certainly not the standard mathematical meaning, so I’m not sure we can make further progress on this explanation.

    I’ll just reiterate that the single “technical” (whatever that means) nonlinearity in ReLU is exactly what lets a layer approximate any continuous[*] function.

    [*] May have forgotten some more adjectives here needed for full precision.

    • If you're confused just show a tanh graph and a ReLU graph to a 7 year old child and ask which one is linear. They'll all get it right. So you're not confused in the slightest bit about anything I've said. There's nothing even slightly confusing about saying a ReLU is made of two lines.

      4 replies →