← Back to context

Comment by timlarshanson

1 day ago

Ok, thanks for the clarification.

Seems the implicit assumption then is that M(q) -> v 'looks like' or 'is smooth like' the dot product, otherwise 'train on keys, inference on queries' wouldn't work ? (safe assumption imo with that l2 norm & in general; unsafe if q and k are from different distributions).

Correct me if I'm wrong, but typically k and v are generated via affine projections K, V of the tokens; if M is matrix-valued and there are no forget and remember gates (to somehow approx the softmax?), then M = V K^-1

It's actually implied in the paper that the neural memory module M can be anything, and there's probably a lot of room to test different kinds of architectures for M. But in this paper M is an MLP of 1 layer (fig. 7 is an ablation study using different number of layers for the MLP).

> using a matrix-valued memory M [...] is an online linear regression objective and so the optimal solution assumes the underlying dependency of historical data is linear. On the other hand, we argue that deep memory modules (i.e., M ≥ 2) . Aligning with the theoretical results that MLPs with at least two layers are strictly more expressive than linear models (Hornik, Stinchcombe, and White 1989), in Section 5.5, we show that deep memory modules are more effective in practice