← Back to context

Comment by visarga

3 years ago

Any comment on how the paper relates to "Every Model Learned by Gradient Descent Is Approximately a Kernel Machine" by Pedro Domingos?

https://arxiv.org/pdf/2308.16898.pdf

This seems related to NTK literature i.e. wide neural nets behave like kernel regression. NTK is a great tool but a notable weakness is kernel view doesn't explain how the model learns new features. Transformer is also pretty different from standard neural architectures because tokens interact with each other through attention. Our goal was capturing this interaction and we believe there is a clean insight on feature learning: Attention is running a token-selection procedure by implementing an SVM that separates tokens.

  • See our re-examination of the kernel equivalence. Path kernels exactly measure how models learn as their understanding of data improves during training, and this can be expressed in terms of the gradients with regards to each trianing input: https://arxiv.org/abs/2308.00824

    We believe that all neural networks are effectively an SVM or more generally reproducing kernel architecture to implicitly layer the understanding contributed during each training iteration. Do you have any comment in the RKHS or RKBS context for transformers?