Comment by andy12_

9 months ago

It's actually implied in the paper that the neural memory module M can be anything, and there's probably a lot of room to test different kinds of architectures for M. But in this paper M is an MLP of 1 layer (fig. 7 is an ablation study using different number of layers for the MLP).

> using a matrix-valued memory M [...] is an online linear regression objective and so the optimal solution assumes the underlying dependency of historical data is linear. On the other hand, we argue that deep memory modules (i.e., M ≥ 2) . Aligning with the theoretical results that MLPs with at least two layers are strictly more expressive than linear models (Hornik, Stinchcombe, and White 1989), in Section 5.5, we show that deep memory modules are more effective in practice