Comment by bhadass
20 days ago
but isn't this like a lot of other CS-related "gradient descent"?
when someone invents a new scheduling algorithm or a new concurrent data structure, it's usually based on hunches and empirical results (benchmarks) too. nobody sits down and mathematically proves their new linux scheduler is optimal before shipping it. they test it against representative workloads and see if there is uplift.
we understand transformer architectures at the same theoretical level we understand most complex systems. we know the principles, we have solid intuitions about why certain things work, but the emergent behavior of any sufficiently complex system isn't fully predictable from first principles.
that's true of operating systems, distributed databases, and most software above a certain complexity threshold.
No. Algorithm analysis is much more sophisticated and well defined than that. Most algorithms are deterministic, and it is relatively straightforward to identify complexity, O(). Even nondeterministic algorithms we can evaluate asymptotic performance under different categories of input. We know a lot about how an algorithm will perform under a wide variety of input distributions regardless of determinism. In the case of schedulers, and other critical concurrency algorithms, performance is well known before release. There is a whole subfield of computer science dedicated to it. You don't have to "prove optimality" to know a lot about how an algorithm will perform. What's missing in neural networks is the why and how any inputs will propagate, through the network during inference. It is a black box of understandability. Under a great deal of study, but still very poorly understood.
i agree w/ the the complexity analysis point, but that theoretical understanding actually translates to real world deployment decisions in both subfields. knowing an algorithm is O() tells you surprisingly little about whether itll actually outperform alternatives on real hardware with real cache hierarchies, branch predictors, and memory access patterns. same thing with ML (just with the very different nature of GPU hw), both subfields hve massive graveyards of "improvements" that looked great on paper (or in controlled environments) but never made it into production systems. arxiv is full of architecture tweaks showing SOTA on some benchmark and the same w/ novels data structures/algorithms that nobody ever uses at scale.
I think you missed the point. Proving something is optimal, is a much higher bar than just knowing how the hell the algorithm gets from inputs to outputs in a reasonable way. Even concurrent systems and algorithm bounds under input distributions have well established ways to evaluate them. There is literally no theoretical framework for how a neural network churns out answers from inputs, other than the most fundamental "matrix algebra". Big O, Theta, Omega, and asymptotic performance are all sound theoretical methods to evaluate algorithms. We don't have anything even that good for neural networks.