Comment by xnx

1 year ago

It's curse and a blessing that discussion of topics happens in so many different places. I found this comment on Twitter/X interesting: https://x.com/fchollet/status/1841902521717293273

"Interesting work on reviving RNNs. https://arxiv.org/abs/2410.01201 -- in general the fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)

Curve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape. As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime."

113 comments

xnx

drodgers 1 year ago

> The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape

I have almost the opposite take. We've had a lot of datasets for ages, but all the progress in the last decade has come from advances how curves are architected and fit to the dataset (including applying more computing power).

Maybe there's some theoretical sense in which older models could have solved newer problems just as well if only we applied 1000000x the computing power, so the new models are 'just' an optimisation, but that's like dismissing the importance of complexity analysis in algorithm design, and thus insisting that bogosort and quicksort are equivalent.

When you start layering in normalisation techniques to minimise overfitting, and especially once you start thinking about more agentic architectures (eg. Deep Q Learning, some of the search space design going into OpenAI's o1), then I don't think the just-an-optimisation perspective can hold much water at all - more computing power simply couldn't solve those problems with older architectures.

eru 1 year ago
I see what you are saying, and I made a similar comment.
However it's still an interesting observation that many architectures can arrive at the same performance (even though the training requirements are different).
Naively, you wouldn't expect eg 'x -> a * x + b' to fit the same data as 'x -> a * sin x + b' about equally well. But that's an observation from low dimensions. It seems once you add enough parameters, the exact model doesn't matter too much for practical expressiveness.
I'm faintly reminded of the Church-Turing Thesis; the differences between different computing architectures are both 'real' but also 'just an optimisation'.
> When you start layering in normalisation techniques to minimise overfitting, and especially once you start thinking about more agentic architectures (eg. Deep Q Learning, some of the search space design going into OpenAI's o1), then I don't think the just-an-optimisation perspective can hold much water at all - more computing power simply couldn't solve those problems with older architectures.
You are right, these normalisation techniques help you economise on training data, not just on compute. Some of these techniques can be done independent of the model, eg augmenting your training data with noise. But some others are very model dependent.
I'm not sure how the 'agentic' approaches fit here.
- refulgentis 1 year ago
  
  > Naively, you wouldn't expect
  I, a nave, expected this.
  Is multiplication versus sine in the analogy hiding it, perhaps?
  I've always pictured it as just "needing to learn" the function terms and the function guts are an abstraction that is learned.
  Might just be because I'm a physics dropout with a bunch of whacky half-remembered probably-wrong stuff about how any function can be approximated by ex. fourier series.
  
  8 replies →
- dboreham 1 year ago
  
  This reminds me of control systems theory where provided there's feedback, the forward transfer function doesn't matter beyond very basic properties around the origin.
f1shy 1 year ago
Wait! We certainly did NOT have huge datasets (like current internet) for ages. Not even decades. I’ve seen a lecture by a MIT professor (which I cannot find now) where he asserted categorically, that the advances in AI are mostly because of the huge data that we now have and we didn’t before. And that was an old video.
- yosefk 1 year ago
  
  Whichever way it's true in, it's not true in the sense that eg you can approximate any curve with a single layer neural net, and you're not actually going to be able to do it for problems CNNs or transformers work decently on. And Google indexed all of the public Internet way before its researchers came up with transformers.
  Another way to look at it is that like you say, it was an old video but there has been progress since though we had large datasets when it came out by its own definition
tsimionescu 1 year ago
I think by far the biggest advances are related to compute power. The amount of processing needed to run training algorithms on the amounts of data needed for the latest models was just not possible even five years ago, and definitely not ten years ago.
I'm sure there are optimizations from the model shape as well, but I don't think that running the best algorithms we have today with hardware from five-ten years ago would have worked in any reasonable amount of time/money.
- freeqaz 1 year ago
  
  A 30bn param model, hell even a 7bn param model, is still incredibly useful and I feel like that could have been doable a decade ago!
  We have GPT-4 (or at least 3.5) tier performance in these much smaller models now. If we teleported back in time it may have been possible to build
  
  1 reply →
mirekrusin 1 year ago

Isn’t bogosort transformer and quicksort proposed modified rnn (175 times faster training for 500 seq) here?

islewis 1 year ago

> "As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime."

I haven't fully ingested the paper yet, but it looks like it's focused more on compute optimization than the size of the dataset:

> ... and (2) are fully parallelizable during training (175x faster for a sequence of length 512

Even if many types of architectures converge to the same loss over time, finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale.

teruakohatu 1 year ago
> Even if many types of architectures converge to the same loss over time, finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale.
This! Not just fastest but with the lowest resources in total.
Fully connected neural networks are universal functions. Technically we don’t need anything but a FNN, but memory requirements and speed would be abysmal far beyond the realm of practicality.
- actionfromafar 1 year ago
  
  Unless we could build chips in 3D?
  
  9 replies →
byearthithatius 1 year ago

> finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale
Not to him, he runs the ARC challenge. He wants a new approach entirely. Something capable of few-shot learning out of distribution patterns .... somehow

wongarsu 1 year ago

One big thing that bells and whistles do is limit the training space.

For example when CNNs took over computer vision that wasn't because they were doing something that dense networks couldn't do. It was because they removed a lot of edges that didn't really matter, allowing us to spend our training budget on deeper networks. Similarly transformers are great because they allow us to train gigantic networks somewhat efficiently. And this paper finds that if we make RNNs a lot faster to train they are actually pretty good. Training speed and efficiency remains the big bottleneck, not the actual expressiveness of the architecture

nutanc 1 year ago

This is true. This is the reason, in many of our experiments we find that using a new algorithm, KESieve, we actually find the planes much faster than the traditional deep learning training approaches. The premise is, a neaural network builds planes which separate the data and adjusts these planes through an iterative learning process. What if we can find a non iterative method which can draw these same planes. We have been trying this and so far we have been able to replace most network layers using this approach. haven't tried for transformers though yet.
Some links if interested:
[1] https://gpt3experiments.substack.com/p/understanding-neural-...
[2] https://gpt3experiments.substack.com/p/building-a-vector-dat...

sakras 1 year ago

I figured this was pretty obvious given that MLPs are universal function approximators. A giant MLP could achieve the same results as a transformer. The problem is the scale - we can’t train a big enough MLP. Transformers are a performance optimization, and that’s why they’re useful.

acchow 1 year ago

What it will come down to is computational efficiencies. We don’t want to retrain once a month - we want to retrain continuously. We don’t want one agent talking to 5 LLMs. We want thousands of LLMs all working in concert.

ActorNightly 1 year ago

This and also the way models are trained has to be rethought. BPP is good for figuring out complex function mappings, but not for storing information.
pbhjpbhj 1 year ago
Sounds like something that has unsustainable energy costs.

Lerc 1 year ago

I remember one of the initial transformer people saying in an interview that they didn't think this was the "one true architecture" but a lot of the performance came from people rallying around it and pushing in the one direction.

On the other hand, while "As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime." is true, a sufficiently expressive mechanism may not be computationally or memory efficient. As both are constraints on what you can actually build, it's not whether the architecture can produce the result, but whether a feasible/practical instantiation of that architecture can produce the result.

viktor_von 1 year ago

> I remember one of the initial transformer people saying in an interview that they didn't think this was the "one true architecture" but a lot of the performance came from people rallying around it and pushing in the one direction.
You may be referring to Aidan Gomez (CEO of Cohere and contributor to the transformer architecture) during his Machine Learning Street Talk podcast interview. I agree, if as much attention had been put towards the RNN during the initial transformer hype, we may have very well seen these advancements earlier.

ants_everywhere 1 year ago

> is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)

(Somewhat) fun and (somewhat) related fact: there's a whole cottage industry of "is all you need" papers https://arxiv.org/search/?query=%22is+all+you+need%22&search...

TaurenHunter 1 year ago
Reminds me of the "Considered Harmful" articles:
https://meyerweb.com/eric/comment/chech.html
- bee_rider 1 year ago
  
  Quick, somebody write “All you need Considered Harmful” and “Considered Harmful all you need.”
  Which seems closer to true?
  
  1 reply →
- jprete 1 year ago
  
  I wonder if there's something about tech culture - or tech people - that encourages them to really, really like snowclones.
  
  2 replies →
tsimionescu 1 year ago

Starting of course with the classic paper from Lennon and McCartney, 1967.

ctur 1 year ago

Architecture matters because while deep learning can conceivably fit a curve with a single, huge layer (in theory... Universal approximation theorem), the amount of compute and data needed to get there is prohibitive. Having a good architecture means the theoretical possibility of deep learning finding the right N dimensional curve becomes a practical reality.

Another thing about the architecture is we inherently bias it with the way we structure the data. For instance, take a dataset of (car) traffic patterns. If you only track the date as a feature, you miss that some events follow not just the day-of-year pattern but also holiday patterns. You could learn this with deep learning with enough data, but if we bake it into the dataset, you can build a model on it _much_ simpler and faster.

So, architecture matters. Data/feature representation matters.

mr_toad 1 year ago
> can conceivably fit a curve with a single, huge layer
I think you need a hidden layer. I’ve never seen a universal approximation theorem for a single layer network.
- dongecko 1 year ago
  
  I second that thought. There is a pretty well cited paper from the late eighties called "Multilayer Feedforward Networks are Universal Approximators". It shows that a feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function. For non continous function additional layers are needed.
- ted_dunning 1 year ago
  
  Minsky and Papert showed that single layer perceptrons suffer from exponentially bad scaling to reach a certain accuracy for certain problems.
  Multi-layer substantially changes the scaling.

eru 1 year ago

Well, you also need an approach to 'curve fitting' where it's actually computationally feasible to fit the curve. The approach of mixing layers of matrix multiplication with a simple non-linearity like max(0, x) (ReLU) works really well for that. Earlier on they tried more complicated non-linearities, like sigmoids, or you could try an arbitrary curve that's not split into layers at all, you would probably find it harder. (But I'm fairly sure in the end you might end up in the same place, just after lots more computation spent on fitting.)

WithinReason 1 year ago

If you spent some time actually training networks you know that's not true, that's why batch norm, dropout, regularization is so successful. They don't increase the network's capacity (parameter count) but they increase its ability to learn.

avereveard 1 year ago

well yes but actually no I guess: the transformers benefit at the time was that they were more stable while learning, enabling larger and larger network and dataset to be learnt.

tippytippytango 1 year ago

Inductive bias matters. A lot.

dheera 1 year ago

I mean, transformer-based LLMs are RNNs, just really really really big ones with very wide inputs that maintain large amounts of context.

immibis 1 year ago
No. An RNN has an arbitrarily-long path from old inputs to new outputs, even if in practice it can't exploit that path. Transformers have fixed-size input windows.
- dheera 1 year ago
  
  A chunk of the output still goes into the transformer input, so the arbitrarily-long path still exists, it just goes through a decoding/encoding step.
- WithinReason 1 year ago
  
  no, you can give as much context to a transformer as you want, you just run out of memory
  
  2 replies →
- famouswaffles 1 year ago
  
  You can't have a fixed state and have arbitrarily-long path from input. Well you can but then it's just meaningless because you fundamentally cannot keep stuffing information of arbitrary length into a fixed state. RNNs effectively have fixed-size input windows.
  
  3 replies →

fsndz 1 year ago

[flagged]

josh-sematic 1 year ago

One reason why I'm excited about o1 is that it seems like OpenAI have cracked the nut of effective RL during training time, which takes us out of the domain of just fitting to the curve of "what a human would have said next." I just finished writing a couple blog posts about this; the first [1] covers some problems with that approach and the second [2] talks about what alternatives might look like.
[1] https://www.airtrain.ai/blog/how-openai-o1-changes-the-llm-t... [2] https://www.airtrain.ai/blog/how-openai-o1-changes-the-llm-t...
acchow 1 year ago

> After reading this paper, I am now
Is this your paper?
ahzhou 1 year ago

Author: @fandzomga Username: fsndz
Why try to funnel us to your paywalled article?
xpl 1 year ago
I would like to read it, but it's under a paywall.
- alwa 1 year ago
  
  https://archive.is/nGaiU
swolchok 1 year ago
paper is paywalled; just logging into Medium won't do it
- fsndz 1 year ago
  
  sorry for the paywall, you can read the free version here: https://www.lycee.ai/blog/why-no-agi-openai
vineyardmike 1 year ago

TLDR: “statistically fitting token output is not the same as human intelligence, and human intelligence and AGI are contradictory anyways (because humans make mistakes)”
Saved you the paywall click to the poorly structured medium article :)

_giorgio_ 1 year ago

Chollet is just a philosopher. He also thinks that keras and tensorflow are important, when nobody uses those. And he punished false days about their usage.

quantadev 1 year ago

Most LLMs aren't even using a "curve" yet at all, right? All they're using is a series of linear equations because the model weights are a simple multiply and add (i.e. basic NN Perceptron). Sure there's a squashing function on the output to keep it in a range from 0 to 1 but that's done BECAUSE we're just adding up stuff.

I think probably future NNs will be maybe more adaptive than this perhaps where some Perceptrons use sine wave functions, or other kinds of math functions, beyond just linear "y=mx+b"

It's astounding that we DID get the emergent intelligence from just doing this "curve fitting" onto "lines" rather than actual "curves".

OkayPhysicist 1 year ago
The "squashing function" necessarily is nonlinear in multilayer nueral networks. A single layer of a neural network can be quite simply written a weight matrix, times an input vector, equalling an output vector, like so
Ax = y
Adding another layer is just multiplying a different set of weights times the output of the first, so
B(Ax)= y
If you remember your linear algebra course, you might see the problem: that can be simplified
(BA)x = y
Cx = y
Completely indistinguishable from a single layer, thus only capable of modeling linear relationships.
To prevent this collapse, a non linear function must be introduced between each layer.
- quantadev 1 year ago
  
  Right. All the squashing is doing is keeping the output of any neuron in a range of below 1.
  But the entire NN itself (Perceptron ones, which most LLMs are) is still completely using nothing but linearity to store all the knowledge from the training process. All the weights are just an 'm' in the basic line equation 'y=m*x+b'. The entire training process does nothing but adjust a bunch of slopes of a bunch of lines. It's totally linear. No non-linearity at all.
  
  29 replies →
mr_toad 1 year ago
> It's astounding that we DID get the emergent intelligence from just doing this "curve fitting" onto "lines" rather than actual "curves".
In Ye Olden days (the 90’s) we used to approximate non-linear models using splines or seperate slopes models - fit by hand. They were still linear, but with the right choice of splines you could approximate a non-linear model to whatever degree of accuracy you wanted.
Neural networks “just” do this automatically, and faster.
- quantadev 1 year ago
  
  In college (BSME) I wrote a computer program to generate cam profiles from Bezier curves. It's just a programming trick to generate curves from straight lines at any level of accuracy you want just by letting the computer take smaller and smaller steps.
  It's an interesting concept to think of how NNs might be able to exploit this effect in some way based on straight lines in the weights, because a very small number of points can identify avery precise and smooth curves, where directions on the curve might equate to Semantic Space Vectors.
  
  1 reply →