Comment by islewis
1 year ago
> "As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime."
I haven't fully ingested the paper yet, but it looks like it's focused more on compute optimization than the size of the dataset:
> ... and (2) are fully parallelizable during training (175x faster for a sequence of length 512
Even if many types of architectures converge to the same loss over time, finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale.
> Even if many types of architectures converge to the same loss over time, finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale.
This! Not just fastest but with the lowest resources in total.
Fully connected neural networks are universal functions. Technically we don’t need anything but a FNN, but memory requirements and speed would be abysmal far beyond the realm of practicality.
Unless we could build chips in 3D?
Not even then, a truly fully connected network would have super exponential runtime (it would take N^N time to evaluate)
6 replies →
We are already doing this.
Heat extraction.
> finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale
Not to him, he runs the ARC challenge. He wants a new approach entirely. Something capable of few-shot learning out of distribution patterns .... somehow