Comment by kqr
3 days ago
What confuses me about deep nets is that there's rarely enough signal to be able to meaningfully train a large number of parameters. Surely 99 % of those parameters are either (a) incredibly unstable, or (b) correlate perfectly with other parameters?
They do. There are enormous redundancies. There's a manifold over which the parameters can vary wildly yet do zilch to the output. The nonlinear analogue of a null space.
Parameter instability does not worry a machine learner as much as it worries a statistician. ML folks worry about output instabilities.
The current understanding goes that this overparameterization makes reaching good configurations easier while keeping the search algorithm as simple as stochastic gradient descent.
Huh, I didn't know that! Are there efforts to automatically reduce the number of parameters once the model is trained? Or do the relationships between parameters end up too complicated to do that? I would assume such a reduction would be useful for explainability.
(Asking specifically about time series models and such.)
What you are looking for is the lottery ticket hypothesis for neural networks. Hit a search engine with those words you will find examples.
https://arxiv.org/abs/1803.03635 ( you can follow up on semantic scholar for more)
Selecting which weights to discard seems as hard as the original problem. But random decimation, sometimes barely informed decimation have been observed to be effective.
On the theory side now it's understood that in the thicket of weights, lurk a much much smaller subset that can have nearly the same output.
These observations are for DNNs in general. For time series specifically I don't know what the state of the art is. In general NNs are still catching up with traditional stats approaches in this domain. There are a few examples where traditional approaches have been beaten, but only a few.
One good source to watch are the M series of competitions.
It's fairly standard to prune the hell out of a model for deployment, because many of the parameters end up being close to zero. This doesn't really help with explainability of the parameters, because (imo) that's a dead end. You assume that the data is iid and a representative sample of whatever god-given function generated it, and you throw a universal approximator at it because it's impossible to come up with some a priori function that models the data in the first place.
Latent space clustering is about as good as it gets imo, and in my experience, that's fairly stable for individual implementations (but not necessarily across implementations for the same model, for various reasons), but it doesn't tell you anything about the meaning of the parameters themselves. If the model is well calibrated, you can validate its performance and it becomes explainable as a unit.
If you started with a deep neural network, one can't really use pruning to go all the way down to a parameter count that is directly intepretable (say under 100). One would at least have to try some techniques to get more disentangled representations. But local surrogate models are popular for explainability, see Shap and LIME. For interpretable time series I would encourage to construct features and transformations the old fashioned way, and then learn it all end to end as a differentiable program. Then you can get the best of both worlds.