Why Momentum Works

9 years ago (distill.pub)

I'm curious about the method chosen to give short term memory to the gradient. The most common way I've seen when people have a time sequence of values X[i] and they want to make a short term memory version Y[i] is to do something of this form:

  Y[i+1] = B * Y[i] + (1-B) * X[i+1]

where 0 <= B <= 1.

Note that if the sequence X becomes a constant after some point, the sequence Y will converge to that constant (as long as B != 1).

For giving the gradient short term memory, the article's approach is of the form:

  Y[i+1] = B * Y[i] + X[i+1]

Note that if X becomes constant, Y converges to X/(1-B), as long as B in [0,1).

Short term memory doesn't really seem to describe what this is doing. There is a memory effect in there, but there is also a multiplier effect when in regions where the input is not changing. So I'm curious how much of the improvement is from the memory effect, and how much from the multiplier effect? Does the more usual approach (the B and 1-B weighting as opposed to a B and 1 weighting) also help with gradient descent?

  • I assume that multiplying by a given factor shouldn't matter since you still have the learning rate as a factor (which is itself a factor of the gradient). This might just mean that the learning rate should be lower or higher with this method.

  • Very good question! I have considered this issue too. This form of weighting is the kind used in ADAM, and is qualitatively different from the updates described here. The tools of analysis in this article can be used to understand that iteration too, (this amounts to a different R matrix) and I would be curious to see if it too allows for a quadratic speedup.

    [EDIT] As per halfling's comment, this is just a change of the learning rate by (1-beta)

It would be nice if the introduction made clear that the "Momentum" that "works" is some algorithm and not at all the physical concept "momentum".

  • At least it wasn't some random software project. My expectation was about 50/50 between "something to do with actual momentum" and "yet another Javascript framework".

  • distill.pub is a blog specifically for machine learning. While the poster could have added something to the title, the usual around here is to leave it as is.

  • Good point. I also had no idea what this is talking about. An introductory sentence about it in the article would have been good.

Reminds me of this paper, "Causal Entropic Forces":

http://math.mit.edu/~freer/papers/PhysRevLett_110-168702.pdf

I can't follow the math but the presentation is gorgeous (Safari on a MacBook retina display). Really great, keep up the good work!

  • not following the math is understandable since it only takes a little unfamiliarity to make math seem hard to understand.

    but if you've studied linear systems a bit, the math is laid out really well for understanding and turning into practical applications (which seems to be the whole point of distill.pub).

    many math articles, in contrast, seem to obfuscate the "why" and the "what for" and concentrate on the theory and derivation (e.g., many wikipedia math articles). that might be great for mathematical elegance (or cynically, oneupsmanship), but not so great for people trying to build things using that valuable knowledge.

  • Which part of the math did you find difficult to follow?

    • Better question: What background is the reader expected to have?

      Until Xi, not a single variable is defined prior to inclusion into an expression. Even Alpha and Beta are only defined in the header diagram rather than the body of the text. Also, why are the iteration notations in the superscript rather than subscript?

      And before someone chimes in and says that rudiments are necessary to understand this work, no they aren't. The logical steps here are exceptionally simple (and intuitive - as the introduction might lead you to believe) once you get past the delivery. This is a fantastic article that could become very accessible with the proper notation housekeeping.

      1 reply →

I have to say that the dynamics of momentum diagram is a thing of real beauty. The whole paper felt a little NYTimes like and then of course, I see that Shan Carter helped a little bit with it!

But simply not on Firefox

  • Thanks for pointing that out! We fixed the diagram bug in firefox. There's still bad performance for ~30s after page load -- we're looking into why that's happening -- but after that the page seems to work well.

    • The generating of the nice formulas from TeX syntax seems to eat most of those 30s for me - maybe you could do that during the build instead of pushing that work to the client?

    • Cool that you fixed it already! I should have started with some praise: great selection of articles

Curious, has this method been used for solving linear systems? How would it perform e.g. against conjugate gradient?

And how would it perform for non-positive-definite systems?

  • Author here - yes! It can be used to solve symmetric PSD Systems, as the solution of Ax = b is also the minimizer of 0.5*x'Ax - b'x. Conjugate gradient can be seen as a special form of momentum with adaptive alphas and betas

    • > Conjugate gradient can be seen as a special form of momentum

      Just to be clear, though, CG doesn't use the negative gradients as search directions, as steepest descent would.

Some of the multi-author articles on Distill have a very important (IMO) innovation. They quantify precisely the contribution each author has made to the article. I would like to see this become norm in scientific papers, so on the one hand it'd be clear who to ask questions if they arise, and on the other the various dignitaries won't get their honorary spot on the authors list of papers they were barely involved with scientifically.

  • As others mentioned, this is common in other fields, it's just not done in machine learning.

    We've been reading through the policies of lots of journals that seem thoughtfully run and borrowing good ideas. :P

  • Other journals, e.g. Nature, already had this. It's not a Distil innovation. But I agree, this should be the norm in journals/confs/etc.

For those that don't read materials about optimization as much as they maybe should, what is "w​⋆"? It used without introduction and I don't know what it is. Perhaps this a convention I am not aware of?

I'm really loving the choice of articles, especially since you're just getting started.

Edit: I'm referring to the journal, not the author.

Pretty cool, it started using stuff about classical control theory. I always kinda missed that classical controls weren't really brought up in the discussion of gradient descent.

People, listen. "Damping" means to reduce the amplitude of an oscillation. "Dampening" means to make something wetter.

</rant>

Hm. So that helps with high-frequency noise. Any progress on what to do when the dimensions are of vastly different scales? I have an old physics engine which had to solve about 20-value nonlinear differential equations. During a collision, the equations go stiff, and some dimensions may be 10 orders of magnitude steeper than others. Gradient descent then faces very steep knife edges. This is called a "stiff system" numerically.

  • Author here - I believe the problem of a "stiff system" you're referring to is exactly the problem of pathological curvature!

    Some points not touched on in the article. If the individual dimensions are of different scales, this problem can be easily fixed with a diagonal preconditioner. Even something like ADAM or Adagrad (unconventional, I know, in this domain) can be used.

    There's also a small industry around more sophisticated preconditioners for the linear systems in PDEs, see Multigrid, for example, or preconditioned conjugate gradient.

    • The stiffness may be local. It definitely is in a physical simulation for hard collisions. Machine learning data is usually normalized into [0..1], so if you get a really steep slope, something is pathological.

  • I'm not an expert on anything covered in the article but we have a similar physics based model at my work (complex non-linear equations) we use a technique called Sequential Quadratic Programming (SQP) to find an optimal solution. My understanding is that this gives better results than using gradient descent but will only work if the functions are continuous.

    This could be worth looking into for you.

It would be really, really great if you could somehow hook this up to Discourse so people could comment on and ask questions about the article. Allowing people to ask questions and having others answer like MathOverflow would I think bring a lot more clarity. Many different kinds of people want to understand material like this but may need the math unpacked in different ways.

  • I don't see how Discourse will be better than something like HN or reddit. There's also a submission to Reddit[1].

    With discourse, I think there will be more noise and a lot of time will be wasted scrolling through unnecessary replies. What's good about the thread-like nature of HN/Reddit is that you'll have proper context and the rating system does its job so everyone's time won't be wasted.

    Questions can be answered here and on Reddit too. I think Reddit can be sometimes more helpful than HN when it comes to answering 'easy' questions so I would go there if you have any simple questions.

    [1]: https://www.reddit.com/r/MachineLearning/comments/63f3uk/r_w...

    • I disagree. The most useful replies can be upvoted to deal with noise, and using mathematical formulas to help explain responses is absolutely crucial. I can't tell you how helpful it has been to be looking at an obscure proof, post a question to Math Overflow, and have the answer explained in an intuitive way with reference to the symbols and notation used.

      These articles on distill I believe could greatly benefit from this. Let the community help distill.

      2 replies →

    • HN doesn't work if someone has a question in 2 weeks. Both HN and reddit have an incredible skew towards current things (sort of in the term news aggregator, although current = "whatever was recently submitted", not necessarily = "news"), which doesn't really fit posts like this that are relevant for longer. A subreddit might help, but even there things fall off after a while, regardless of the discussion status.

> The added inertia acts both as a smoother and an accelerator

    momentum = mass * velocity

    force = mass * acceleration

    momentum = (force / acceleration) * velocity

So, it looks to me like momentum is inversely related to acceleration. It doesn't seem right to call momentum an "accelerator".

  • Hi! This article is about momentum in the mathematical field of optimization. Acceleration also refers to a phenomenon in optimization. While there are deep connections to their physical analogues, they aren't aren't quite the same thing.

    If you want to make your analogy work, the momentum algorithm adds mass to an object. In terms of literal acceleration at any point in time it is neutral, but the added mass causes you to get through difficult areas much faster.

    That said, much of this article is moving away from that physical analogy. :)

    • The use of the term "accelerator" is misleading as it's introduced during the physical metaphor. You should consider editing that sentence to clarify your meaning:

      The added inertia smooths out variation in velocity, dampening oscillations and causing us to barrel through narrow valleys, small humps and local minima. Keeping our speed steadier, we arrive at the global optimum faster.

      Note the alliteration :-)

      > momentum accelerates convergence

      Here it's more clear that momentum is accelerating convergence, not the "heavy ball" itself.

      > inertia acts both as a smoother and an accelerator

      On the second read, it's more clearly a contradiction. Speed can't be both held steadier and accelerated simultaneously. If you meant that momentum alternately smooths and accelerates, then it's even more strange. For that behavior, some sort of motor would be a more appropriate metaphor.

      3 replies →

  • But velocity is the integral of acceleration, and for constant acceleration, this means

        velocity = 1/2*acceleration*time^2
        momentum = 1/2*force*time^2 = 1/2*mass*acceleration*time^2
    

    So momentum is actually proportional to acceleration in this case.

    • Hmm. Think of the response that acceleration would have, all else equal, if momentum increased.

      There's two ways to change momentum -- velocity or mass. If mass increases, acceleration decreases. If velocity increases, acceleration remains the same.

      Edit: You've got a bit of an endogeneity problem in your equation. And I guess I do, too.

          momentum = 1/2 * mass * acceleration * time ^ 2
      

      Since momentum is equivalent to mass times velocity, one of its components is on both sides of the equation simultaneously. My original equation had the same problem, except with velocity instead of mass.

      So... I think "momentum" here is leading us a bit in the wrong direction. What the author should have been saying is "mass". After all, the metaphor was "a heavy ball", emphasizing the weight. The initial velocity was unchanged across the comparison.

      It bugs me that so many domains misuse the term "momentum". Business people confuse momentum for velocity. Now machine learning folks are confusing momentum for mass. It's common for people to use a component of the whole as a stand-in term, like "my wheels" instead of "my car". Less common to go the other way: "my arm hurts" instead of "my elbow hurts". In this machine learning case the imprecision of the metaphor harms understanding.

      1 reply →

The math here is beyond my reasoning, but I loved playing with the sliders!

Turn the step size and momentum to maximum for some wonderfully glitchy chaos! (and a demonstration of why forward Euler integration only works well with relatively small step sizes)

Am I the only one who drew parallels to real life and how "just do stuff" often works better than the deliberate, slow step by step process?

In your polynomial regression example, I can't follow what you mean by p_i = \xi \rightarrow \xi^{i-1} when you're setting up the model.

  • I just mean p_i(\xi) = \xi^{i-1}. The notation is a little cumbersome here, but it pays of in the second equation

    • More polynomial regression questions :) Are you using the optimal step size you derived in the previous section? If so, why don't the first and last eigendirections converge at once? If not, doesn't it suggest that there's a trade-off between speed of convergence and the ability to stop early?

      In general, this is a nicely written article, though. Good work!

      1 reply →

    • Sorry, got it. The fact that you introduce the data as \xi_i a line earlier might throw the reader a little.

It seems have no improvement for badly conditioned problems, i.e. k >> 1. The convergence rate is 1 with or without the damped momentum.

This looks like gradient decent passed through a Kalman filter. Which, on reflection, seems like a good idea for overcoming ripples.

Figured I'd mention since I saw the author and an editor in here: in footnote 4, the LaTeX isn't rendering (chrome, OSX).

I was fully expecting an article about some braindead product that nobody needs, called Momentum. Imagine my surprise finding physics and a healthily low percentage of BS.

  • This title, and some others on the main page, made me realize that while a title might make sense in the context of its own site, we are collectively really bad at using titles that are appropriate for passing around on the internet. This article (like many others) could have been about just about anything based on the title alone.