Comment by menaerus

3 months ago

What line of thinking you're referring to?

Transformers were aimed to solve the "context" problem and authors, being aware that RNNs don't scale at all neither do they solve that particular problem, had to come up with the algorithm that overcomes both of those issues. It turned out that the self-attention compute-scale was the crucial ingredient to solve the problem, something that RNNs were totally incapable of.

They modeled the algorithm to run on the hardware they had at that time available but hardware developed afterwards was a direct consequence, or how I called it a byproduct, of transformers proving themselves to be able to continuously scale. Had that not be true, we wouldn't have all those iterations of NVidia chips.

So, although one could say that the NVidia chip design is what enabled the transformers success, one could also say that we wouldn't have those chips if transformers didn't prove themselves to be so damn efficient. And I'm inclined to think the latter.

> This is not "just" machine learning because we have never been able to do things which we are today and this is not only the result of better hardware. Better hardware is actually a byproduct. Why build a PFLOPS GPU when there is nothing that can utilize it?

This is the line of thinking I'm referring to.

The "context" problem had already been somewhat solved. The attention mechanism existed prior to Transformers and was specifically used on RNNs. They certainly improved it, but innovation of the architecture was making it computation efficient to train.

I'm not really following your argument. Clearly your acknowledging that it was first the case that with the hardware at the time, researchers demonstrated that simply scaling up training with more data yielded better models. The fact that hardware was then optimized for these for these architectures only reinforces this point.

All the papers discussing scaling laws point to the same thing, simply using more compute and data yields better results.

> this is not only the result of better hardware

Regarding this in particular. A majority of the improvement from GPT-2 and GPT-4 was simply training on a much larger scale. That was enabled by better hardware and lots of it.

  • > innovation of the architecture was making it computation efficient to train.

    and

    > researchers demonstrated that simply scaling up training with more data yielded better models

    and

    > The fact that hardware was then optimized for these for these architectures only reinforces this point.

    and

    > All the papers discussing scaling laws point to the same thing, simply using more compute and data yields better results.

    is what I am saying as well. I read the majority of those papers so this is all very known to me but I am perhaps writing it down in a more condensed format so that other readers that are light on the topic can pick the idea easier.

    > A majority of the improvement from GPT-2 and GPT-4 was simply training on a much larger scale. That was enabled by better hardware and lots of it.

    Ok, I see your point and the conclusion here is what we disagree with. You say that the innovation was simply enabled by the better hardware whereas I say that that better hardware wouldn't have its place if there hadn't been a great innovation in the algorithm itself. I don't think it's fair to say that the innovation is driven by the NVidia chips.

    I guess my point, simplistically saying, is if we had a lousy algorithm, new hardware wouldn't mean anything without rethinking or rewriting the algorithm. And with the transformers, this definitely hadn't been the case. There had been plenty of optimizations throughout the years in order to better utilize the HW (e.g. flash-attention) but the architecture of transformers remained more or less the same.