Comment by sametoymak
3 years ago
This seems related to NTK literature i.e. wide neural nets behave like kernel regression. NTK is a great tool but a notable weakness is kernel view doesn't explain how the model learns new features. Transformer is also pretty different from standard neural architectures because tokens interact with each other through attention. Our goal was capturing this interaction and we believe there is a clean insight on feature learning: Attention is running a token-selection procedure by implementing an SVM that separates tokens.
See our re-examination of the kernel equivalence. Path kernels exactly measure how models learn as their understanding of data improves during training, and this can be expressed in terms of the gradients with regards to each trianing input: https://arxiv.org/abs/2308.00824
We believe that all neural networks are effectively an SVM or more generally reproducing kernel architecture to implicitly layer the understanding contributed during each training iteration. Do you have any comment in the RKHS or RKBS context for transformers?