← Back to context

Comment by mitchelld

3 months ago

This line of thinking doesn't really correspond to the reason Transformers were developed in the first place, which was to better utilize how GPUs do computation. RNNs were too slow to train at scale because you had to sequentially compute the time steps, Transformers (with masking) can run the input through in a single pass.

It is worth noting that the first "LLM" you referring to was only 300M parameters, but even then the amount of training required (at the time) was such that training a model like that outside of a big tech company was infeasible. Obviously now we have models that are in the hundreds of billions / trillions of parameters. The ability to train these models is directly a result of better / more hardware being applied to the problem as well as the Transformer architecture specifically designed to better conform with parallel computation at scale.

The first GPT model came out ~ 8 years ago. I recall when GPT-2 came out they initially didn't want to release the weights out of concern for what the model could be used for, looking back now that's kind of amusing. However, fundamentally, all these models are the same setup as what was used then, decoder based Transformers. They are just substantially larger, trained on substantially more data, trained with substantially more hardware.

What line of thinking you're referring to?

Transformers were aimed to solve the "context" problem and authors, being aware that RNNs don't scale at all neither do they solve that particular problem, had to come up with the algorithm that overcomes both of those issues. It turned out that the self-attention compute-scale was the crucial ingredient to solve the problem, something that RNNs were totally incapable of.

They modeled the algorithm to run on the hardware they had at that time available but hardware developed afterwards was a direct consequence, or how I called it a byproduct, of transformers proving themselves to be able to continuously scale. Had that not be true, we wouldn't have all those iterations of NVidia chips.

So, although one could say that the NVidia chip design is what enabled the transformers success, one could also say that we wouldn't have those chips if transformers didn't prove themselves to be so damn efficient. And I'm inclined to think the latter.

  • > This is not "just" machine learning because we have never been able to do things which we are today and this is not only the result of better hardware. Better hardware is actually a byproduct. Why build a PFLOPS GPU when there is nothing that can utilize it?

    This is the line of thinking I'm referring to.

    The "context" problem had already been somewhat solved. The attention mechanism existed prior to Transformers and was specifically used on RNNs. They certainly improved it, but innovation of the architecture was making it computation efficient to train.

    I'm not really following your argument. Clearly your acknowledging that it was first the case that with the hardware at the time, researchers demonstrated that simply scaling up training with more data yielded better models. The fact that hardware was then optimized for these for these architectures only reinforces this point.

    All the papers discussing scaling laws point to the same thing, simply using more compute and data yields better results.

    > this is not only the result of better hardware

    Regarding this in particular. A majority of the improvement from GPT-2 and GPT-4 was simply training on a much larger scale. That was enabled by better hardware and lots of it.

    • > innovation of the architecture was making it computation efficient to train.

      and

      > researchers demonstrated that simply scaling up training with more data yielded better models

      and

      > The fact that hardware was then optimized for these for these architectures only reinforces this point.

      and

      > All the papers discussing scaling laws point to the same thing, simply using more compute and data yields better results.

      is what I am saying as well. I read the majority of those papers so this is all very known to me but I am perhaps writing it down in a more condensed format so that other readers that are light on the topic can pick the idea easier.

      > A majority of the improvement from GPT-2 and GPT-4 was simply training on a much larger scale. That was enabled by better hardware and lots of it.

      Ok, I see your point and the conclusion here is what we disagree with. You say that the innovation was simply enabled by the better hardware whereas I say that that better hardware wouldn't have its place if there hadn't been a great innovation in the algorithm itself. I don't think it's fair to say that the innovation is driven by the NVidia chips.

      I guess my point, simplistically saying, is if we had a lousy algorithm, new hardware wouldn't mean anything without rethinking or rewriting the algorithm. And with the transformers, this definitely hadn't been the case. There had been plenty of optimizations throughout the years in order to better utilize the HW (e.g. flash-attention) but the architecture of transformers remained more or less the same.