Comment by D-Machine

2 months ago

I think this mostly comes down to (multi-headed) scaled dot-product attention just being very easy to parallelize on GPUs. You can then make up for the (relative) lack of expressivity / flexibility by just stacking layers.

2 comments

D-Machine

MontyCarloHall 2 months ago

A neural-GP could probably be trained with the same parallelization efficiency via consistent discretization of the input space. I think their absence owes more to the fact that discrete data (namely, text) has dominated AI applications. I imagine that neural-GPs could be extremely useful for scale-free interpolation of continuous data (e.g. images), or other non-autoregressive generative models (scale-free diffusion?)

D-Machine 2 months ago

Right, I think there are plenty of other approaches that surely scale just as easily or better. It's like you said, the (early) dominance of text data just artificially narrowed the approaches tried.