Comment by scarmig

1 year ago

Nonlinearity somewhere is fundamental, but it doesn't need to be between each layer. You can, for instance, project each input to a higher dimensional space with a nonlinearity, and the problem becomes linearly separable with high probability (cf Cover's Theorem).

So, for XOR, (x, y) -> (x, y, xy), and it becomes trivial for a linear NN to solve.

Architectures like Mamba have a linear recurrent state space system as their core, so even though you need a nonlinearity somewhere, it doesn't need to be pervasive. And linear recurrent networks are surprisingly powerful (https://arxiv.org/abs/2303.06349, https://arxiv.org/abs/1802.03308).