← Back to context

Comment by wongarsu

1 year ago

One big thing that bells and whistles do is limit the training space.

For example when CNNs took over computer vision that wasn't because they were doing something that dense networks couldn't do. It was because they removed a lot of edges that didn't really matter, allowing us to spend our training budget on deeper networks. Similarly transformers are great because they allow us to train gigantic networks somewhat efficiently. And this paper finds that if we make RNNs a lot faster to train they are actually pretty good. Training speed and efficiency remains the big bottleneck, not the actual expressiveness of the architecture

This is true. This is the reason, in many of our experiments we find that using a new algorithm, KESieve, we actually find the planes much faster than the traditional deep learning training approaches. The premise is, a neaural network builds planes which separate the data and adjusts these planes through an iterative learning process. What if we can find a non iterative method which can draw these same planes. We have been trying this and so far we have been able to replace most network layers using this approach. haven't tried for transformers though yet.

Some links if interested:

[1] https://gpt3experiments.substack.com/p/understanding-neural-...

[2] https://gpt3experiments.substack.com/p/building-a-vector-dat...