Comment by quantadev

1 year ago

Right, it's obvious that the ReLU is just a gating mechanism, and you can think of that as a decision maker. It's like a "pass thru linearly proportionally" or "block" function.

But I still find it counter-intuitive that it's not common practice in standard LLM NNs to have a trainable parameter that in some way directly "tunes" whatever Activation Function is being applied on EACH output.

For example I almost started experimenting with trigonometric activation functions in a custom NN where the phase angle would be adjusted, inspired by Fourier Series. I can envision a type of NN where every model "weight" is actually a frequency component, because Fourier Series can represent any arbitrary function in this way. There has of course already been similar research done by others along these lines.