← Back to context

Comment by MontyCarloHall

2 days ago

It's utterly baffling to me that there hasn't been more SOTA machine learning research on Gaussian processes with the kernels inferred via deep learning. It seems a lot more flexible than the primitive, rigid dot product attention that has come to dominate every aspect of modern AI.

I think this mostly comes down to (multi-headed) scaled dot-product attention just being very easy to parallelize on GPUs. You can then make up for the (relative) lack of expressivity / flexibility by just stacking layers.

  • A neural-GP could probably be trained with the same parallelization efficiency via consistent discretization of the input space. I think their absence owes more to the fact that discrete data (namely, text) has dominated AI applications. I imagine that neural-GPs could be extremely useful for scale-free interpolation of continuous data (e.g. images), or other non-autoregressive generative models (scale-free diffusion?)

    • Right, I think there are plenty of other approaches that surely scale just as easily or better. It's like you said, the (early) dominance of text data just artificially narrowed the approaches tried.

Doesn't involve Gaussians, but:

The Free Transformer: https://arxiv.org/abs/2510.17558

Abstract: We propose an extension of the decoder Transformer that conditions its generative process on random latent variables which are learned without supervision thanks to a variational procedure. Experimental evaluations show that allowing such a conditioning translates into substantial improvements on downstream tasks.

In addition to what others say said, computational complexity, is a big reason. Gaussian Process and Kernelized SVM have fit complexities of O(n^2) to O(n^3) (where n is the # of samples, also using optimal solutions and not approximations). While Neural Nets and Tree Ensembles are O(n).

I think datasets with lots of samples tend to be very common (such as training on huge text datasets like LLMs do). In my travels most datasets for projects tend to be on the larger side (10k+ samples).

I think they tried it already in the original transformer paper. THe results were not worth implementing.

From the paper(where Additive attention is the other "similarity function"):

Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.