Comment by mr_toad
1 year ago
You need a non-linear activation function for the universal approximation theorem to hold. Otherwise, as others have said the model just collapses to a single layer.
Technically the output is still what a statistician would call “linear in the parameters”, but due to the universal approximation theorem it can approximate any non-linear function.
https://stats.stackexchange.com/questions/275358/why-is-incr...
As you can see in what I just posted about an inch below this, my point is that the process of training a NN does not involve adjusting any parameter to any non-linear functions. What goes into an activation function is a pure sum of linear multiplications and an add, but there's no "tunable" parameter (i.e. adjusted during training) that's fed into the activation function.
Learnable parameters on activations do exist, look up parametric activation functions.
If course they do exist. A parameterized activation function is the most obvious thing to try in NN design, and has certainly been invented/studied by 1000s of researchers.