Comment by seertaak
1 month ago
This -- and obviously David Ng's article -- are absolutely fascinating pieces of work.
I have a few (very naive) questions:
There is a widespread intuition, encapsulated in the very terms "feed-forward networks" and "deep neural networks", that computation in such networks is akin to a circuit wired in series. My "observation" is that residual layers offer an "escape hatch" from this, allowing layers (or sets of layers), to operate in parallel (and of course, something in between).
So here are my dumb questions:
1. Is my intuition about residual networks, at least in principle, allowing for in parallel layers, correct? Or am I missing something fundamental? Let's say the intuition is correct -- is it possible to measure the degree to which a layer operates in series or in parallel?
2. The formula for residual layers (at least to my mind) reminds of an Ornstein-Ühlenbeck time series process. If so, can we measure the degree of mean-reversion of a/several layer(s)? For me, this makes intuitive sense -- the goal of avoiding vanishing gradients feels similar to the goal of stationarity in time series processes.
3. Let's take as an article of faith the central idea of a tripartite network: input->latentspace block => reasoning block => latentspace->output block. Ng's intuition iiuc is that the reasoning block, more or less, wired in series. Intuitively, it feels like that is what it ought to be (i.e., a chain of calculations), though I'll add -- again hand-wavingly -- that OP's efforts appear to cast doubt on this conjecture. Are the two "translation" blocks wired "more" in parallel, then?
4. So what both Ng and OP did was to "tape together" the ostensibly reasoning layers -- in different ways but that's essentially it. Another thing you could do is to treat the input and output translation blocks as fixed. You now train a totally new model on a much smaller corpus of training data, only instead of feeding the input directly to your new model you feed it translated training data (similarly, your targets are now the activations at the entrance to the reasoning->output block. Let's assume it's exactly the same architecture in the middle as the standard netowrk, only it's initialized to random weights as per usual. Surely you should be able to pre-train that 6 layer reasoning network much, much faster. Has anyone tried this?
5. Having thus partitioned a very deep architecture into three distinct parts, there's no reason why you can't experiment with making the reasoning block wider or narrower. Has anyone tried that?
6. Another fun idea is to map a given input through input block and read the pre-reasoning activations. You now let that vector be a random variable and do a random walk through reasoning input space, and use this to "augment" your corpus of training data. Reasonable idea or bullshit?
Please remember, I'm only just (and belatedly) trying to wrap my head around how transformer architectures work -- I'm still waiting for my copy of "Build a Large Language Model (from scratch)"! I hope these questions aren't totally daft!
No comments yet
Contribute on Hacker News ↗