Comment by drodgers

1 year ago

> The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape

I have almost the opposite take. We've had a lot of datasets for ages, but all the progress in the last decade has come from advances how curves are architected and fit to the dataset (including applying more computing power).

Maybe there's some theoretical sense in which older models could have solved newer problems just as well if only we applied 1000000x the computing power, so the new models are 'just' an optimisation, but that's like dismissing the importance of complexity analysis in algorithm design, and thus insisting that bogosort and quicksort are equivalent.

When you start layering in normalisation techniques to minimise overfitting, and especially once you start thinking about more agentic architectures (eg. Deep Q Learning, some of the search space design going into OpenAI's o1), then I don't think the just-an-optimisation perspective can hold much water at all - more computing power simply couldn't solve those problems with older architectures.

I see what you are saying, and I made a similar comment.

However it's still an interesting observation that many architectures can arrive at the same performance (even though the training requirements are different).

Naively, you wouldn't expect eg 'x -> a * x + b' to fit the same data as 'x -> a * sin x + b' about equally well. But that's an observation from low dimensions. It seems once you add enough parameters, the exact model doesn't matter too much for practical expressiveness.

I'm faintly reminded of the Church-Turing Thesis; the differences between different computing architectures are both 'real' but also 'just an optimisation'.

> When you start layering in normalisation techniques to minimise overfitting, and especially once you start thinking about more agentic architectures (eg. Deep Q Learning, some of the search space design going into OpenAI's o1), then I don't think the just-an-optimisation perspective can hold much water at all - more computing power simply couldn't solve those problems with older architectures.

You are right, these normalisation techniques help you economise on training data, not just on compute. Some of these techniques can be done independent of the model, eg augmenting your training data with noise. But some others are very model dependent.

I'm not sure how the 'agentic' approaches fit here.

  • > Naively, you wouldn't expect

    I, a nave, expected this.

    Is multiplication versus sine in the analogy hiding it, perhaps?

    I've always pictured it as just "needing to learn" the function terms and the function guts are an abstraction that is learned.

    Might just be because I'm a physics dropout with a bunch of whacky half-remembered probably-wrong stuff about how any function can be approximated by ex. fourier series.

    • So (most) neural nets can be seen as a function of a _fixed_ form with some inputs and lots and lots of parameters.

      In my example, a and b were the parameters. The kinds of data you can approximate well with a simple sine wave and the kinds of data you can approximate with a straight line are rather different.

      Training your neural net only fiddles with the parameters like a and b. It doesn't do anything about the shape of the function. It doesn't change sine into multiplication etc.

      > [...] about how any function can be approximated by ex. fourier series.

      Fourier series are an interesting example to bring up! I think I see what you mean.

      In theory they work well to approximate any function over either a periodic domain or some finite interval. But unless you take special care, when you apply Fourier analysis naively it becomes extremely sensitive to errors in the phase parameters.

      (Special care could eg be done by hacking up your input domain into 'boxes'. That works well for eg audio or video compression, but gives up on any model generalisation between 'boxes', especially for what would happen in a later box.)

      Another interesting example is Taylor series. For many simple functions Taylor series are great, but for even moderately complicated ones you need to be careful. See eg how the Taylor serious for the logarithm around x=1 works well, but if you tried it around x=0, you are in for a bad time.

      The interesting observation isn't just that there are multiple universal approximators, but that at high enough parameter count, they seem to perform about equally well in how good they are at approximating in practice (but differ in how well they can be trained).

      7 replies →

  • This reminds me of control systems theory where provided there's feedback, the forward transfer function doesn't matter beyond very basic properties around the origin.

Wait! We certainly did NOT have huge datasets (like current internet) for ages. Not even decades. I’ve seen a lecture by a MIT professor (which I cannot find now) where he asserted categorically, that the advances in AI are mostly because of the huge data that we now have and we didn’t before. And that was an old video.

  • Whichever way it's true in, it's not true in the sense that eg you can approximate any curve with a single layer neural net, and you're not actually going to be able to do it for problems CNNs or transformers work decently on. And Google indexed all of the public Internet way before its researchers came up with transformers.

    Another way to look at it is that like you say, it was an old video but there has been progress since though we had large datasets when it came out by its own definition

I think by far the biggest advances are related to compute power. The amount of processing needed to run training algorithms on the amounts of data needed for the latest models was just not possible even five years ago, and definitely not ten years ago.

I'm sure there are optimizations from the model shape as well, but I don't think that running the best algorithms we have today with hardware from five-ten years ago would have worked in any reasonable amount of time/money.

  • A 30bn param model, hell even a 7bn param model, is still incredibly useful and I feel like that could have been doable a decade ago!

    We have GPT-4 (or at least 3.5) tier performance in these much smaller models now. If we teleported back in time it may have been possible to build

    • I think the size of the model is only one part of it. They're still training these 7bn parameter models on the whole data set, and just crunching through that takes enormous compute, that people just didn't have at the current price points until now.

      I should also mention that the idea itself of using GPUs for compute and then specifically for AI training was an innovation. And the idea that simply scaling up was going to be worth the investment is another major innovation. It's not just the existence of the compute power, it's the application to NN training tasks that got us here.

      Here[0] is an older OpenAI post about this very topic. They estimate that between 2012 and 2018, the compute power used for training the SotA models at those times increased roughly 300,000 times, doubling every ~3.5 months.

      [0] https://openai.com/index/ai-and-compute/

Isn’t bogosort transformer and quicksort proposed modified rnn (175 times faster training for 500 seq) here?