Comment by islewis

1 year ago

> "As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime."

I haven't fully ingested the paper yet, but it looks like it's focused more on compute optimization than the size of the dataset:

> ... and (2) are fully parallelizable during training (175x faster for a sequence of length 512

Even if many types of architectures converge to the same loss over time, finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale.

12 comments

islewis

teruakohatu 1 year ago

> Even if many types of architectures converge to the same loss over time, finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale.

This! Not just fastest but with the lowest resources in total.

Fully connected neural networks are universal functions. Technically we don’t need anything but a FNN, but memory requirements and speed would be abysmal far beyond the realm of practicality.

actionfromafar 1 year ago
Unless we could build chips in 3D?
- foota 1 year ago
  
  Not even then, a truly fully connected network would have super exponential runtime (it would take N^N time to evaluate)
  
  6 replies →
- bob1029 1 year ago
  
  We are already doing this.
- ComputerGuru 1 year ago
  
  Heat extraction.

byearthithatius 1 year ago

> finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale

Not to him, he runs the ARC challenge. He wants a new approach entirely. Something capable of few-shot learning out of distribution patterns .... somehow