Comment by r0ze-at-hn

4 hours ago

Linking to the paper: https://arxiv.org/pdf/2605.01172 which is also a fantastic read, the application to deep learning is good. It does a lot of cross-mapping and highlighting a bunch of old stuff that is named differently in this paper and worth calling out for those with those backgrounds:

"Cumulative Dissipation Gramian" Ws = Observability Gramian (from Control Theory). For example the spectral cutoff is exactly the Hankel singular value truncation from model reduction.

"Signal Channel" / "Reservoir" is Controllable/Observable vs. Uncontrollable/Unobservable Subspaces. Using Adamjan-Arov-Krein (AAK) theory gives the optimal nonlinear reduced model answering the optimal compression question.

"Drift–Diffusion Separation" is Freidlin-Wentzell Large Deviation Theory. They can predict "grokking" time from the FW action.

"Population-Risk Gate" is Quantum Weak Value / Postselection (Aharonov)

So for the follow-up problems

Control theory gives the truncation error bounds for model compression. Large deviation theory gives the grokking time predictions. Quantum measurement theory gives the imaginary preconditioners. Information geometry gives the optimal continuous relaxation of the gate.

Some nice implications of new ways of doing stuff which are nice to see formalized here:

Old: Pick architecture, hope it generalizes New: Design architecture to maximize observability Gramian rank (Honestly we pull a lot from control theory here)

Old: Use validation set to detect overfitting New: Monitor λ(Ws) spectrum during training; no validation needed

Old: Prune post-hoc based on magnitude New: Prune during training based on ker(Ws) membership