Comment by menaerus

5 months ago

> innovation of the architecture was making it computation efficient to train.

and

> researchers demonstrated that simply scaling up training with more data yielded better models

and

> The fact that hardware was then optimized for these for these architectures only reinforces this point.

and

> All the papers discussing scaling laws point to the same thing, simply using more compute and data yields better results.

is what I am saying as well. I read the majority of those papers so this is all very known to me but I am perhaps writing it down in a more condensed format so that other readers that are light on the topic can pick the idea easier.

> A majority of the improvement from GPT-2 and GPT-4 was simply training on a much larger scale. That was enabled by better hardware and lots of it.

Ok, I see your point and the conclusion here is what we disagree with. You say that the innovation was simply enabled by the better hardware whereas I say that that better hardware wouldn't have its place if there hadn't been a great innovation in the algorithm itself. I don't think it's fair to say that the innovation is driven by the NVidia chips.

I guess my point, simplistically saying, is if we had a lousy algorithm, new hardware wouldn't mean anything without rethinking or rewriting the algorithm. And with the transformers, this definitely hadn't been the case. There had been plenty of optimizations throughout the years in order to better utilize the HW (e.g. flash-attention) but the architecture of transformers remained more or less the same.

0 comments

menaerus

No comments yet

Contribute on Hacker News ↗