Comment by porcoda
16 hours ago
As others pointed out, the explosion of interest started with the deep convolutional networks that were applied in image problems. What I always thought was interesting was that prior to that, NNs were largely dismissed as interesting. When I took a course on them around the year 2000 that was the attitude most people took. It seems like what it took to spark renewed interest was ImageNet and seeing what you get when you have a ton of training data to throw at the problem and fast processors to help. After that the ball kept rolling with the subsequent developments around specific network architectures. In the broader community AlexNet is viewed as the big inflection point, but in the academic community you saw interest simmering a couple years earlier - I began to see more talks at workshops about NNs that weren’t being dismissed anymore, probably starting around 2008/09.
I played with NNs in the late 80's/early 90s, with little more than a copy of Hinton's paper, a PC and a C compiler. Obviously, I got no practical results. But I got the intuition of how they worked and what they could potentially do.
Cut to 2008-9,and I started to see smartphones, grid (then cloud) computing and social networks emerging. My MBA dissertation, finished in 2011, was about how that would change the world, because the requirements for meaningful AI were coming along - data and compute. The theory was already there, Hinton, LeCun, Schmidhuber,etc.
That got me back into the Data Science field, after years working in Data Engineering. Too bad I lived in Brazil back then and couldn't find a way to join the emerging scene in California and other top places. I'd be rich now...
> NNs were largely dismissed
I agree with your larger point but dismissed is rather too strong. They were considered fiddly to train, prone to local minima, long training time, no clear guidelines about what the number of hidden layers and number of nodes ought to be. But for homework (toy) exercises they were still ok.
In comparison, kernel methods gave a better experience over all for large but not super large data sets. Most models had easily obtainable global minimum. Fewer moving parts and very good performance.
It turns out, however, that if you have several orders of magnitude more data, the usual kernels are too simple -- (i) they cannot take advantage of more data after a point and start twiddling the 10th place of decimal of some parameters and (ii) are expensive to train for very large data sets. So bit of a double whammy. Well, there was a third, no hardware acceleration that can compare with GPUs.
Kernels may make a comeback though, you never know. We need to find a way to compose kernels in a user friendly way to increase their modeling capacity. We had a few ways of doing just that but they weren't great. We need a breakthrough to scale them to GPT sized data sets.
In a way DNNs are "design your own kernels using data" whereas kernels came in any color you liked provided it was black (yes there were many types, but it was still a fairly limited catalogue. The killer was that there was no good way of composing them to increase modeling capacity that yielded efficiently trainable kernel machines)
Deepmind solving Atari games was another big milestone around that time.