Why Momentum Works

9 years ago (distill.pub)

99 comments

m_ke

I'm curious about the method chosen to give short term memory to the gradient. The most common way I've seen when people have a time sequence of values X[i] and they want to make a short term memory version Y[i] is to do something of this form:

  Y[i+1] = B * Y[i] + (1-B) * X[i+1]

where 0 <= B <= 1.

Note that if the sequence X becomes a constant after some point, the sequence Y will converge to that constant (as long as B != 1).

For giving the gradient short term memory, the article's approach is of the form:

  Y[i+1] = B * Y[i] + X[i+1]

Note that if X becomes constant, Y converges to X/(1-B), as long as B in [0,1).

Short term memory doesn't really seem to describe what this is doing. There is a memory effect in there, but there is also a multiplier effect when in regions where the input is not changing. So I'm curious how much of the improvement is from the memory effect, and how much from the multiplier effect? Does the more usual approach (the B and 1-B weighting as opposed to a B and 1 weighting) also help with gradient descent?

halflings 9 years ago
I assume that multiplying by a given factor shouldn't matter since you still have the learning rate as a factor (which is itself a factor of the gradient). This might just mean that the learning rate should be lower or higher with this method.
- im3w1l 9 years ago
  
  The question is then really about which method makes it easier to tune parameters or which helps intuition the most.
- gabrielgoh 9 years ago
  
  this is a good way to think about this.
gabrielgoh 9 years ago

Very good question! I have considered this issue too. This form of weighting is the kind used in ADAM, and is qualitatively different from the updates described here. The tools of analysis in this article can be used to understand that iteration too, (this amounts to a different R matrix) and I would be curious to see if it too allows for a quadratic speedup.
[EDIT] As per halfling's comment, this is just a change of the learning rate by (1-beta)

lutorm 9 years ago

It would be nice if the introduction made clear that the "Momentum" that "works" is some algorithm and not at all the physical concept "momentum".

justinpombrio 9 years ago

At least it wasn't some random software project. My expectation was about 50/50 between "something to do with actual momentum" and "yet another Javascript framework".
jnordwick 9 years ago

distill.pub is a blog specifically for machine learning. While the poster could have added something to the title, the usual around here is to leave it as is.
auggierose 9 years ago

Good point. I also had no idea what this is talking about. An introductory sentence about it in the article would have been good.

wonderous 9 years ago

Reminds me of this paper, "Causal Entropic Forces":

http://math.mit.edu/~freer/papers/PhysRevLett_110-168702.pdf

RangerScience 9 years ago
This is one of my favorite things ever. You should also look up "Jeremy England's Entropic Life" for a good companion thought.
- wonderous 9 years ago
  
  A New Physics Theory of Life: https://www.quantamagazine.org/20140122-a-new-physics-theory...
  How Life (and Death) Spring From Disorder: https://www.wired.com/2017/02/life-death-spring-disorder/
  
  1 reply →
- udkl 9 years ago
  
  Relevant to this discussion about 'entropy and life' is the relation between the second law of thermodynamics and the purpose of life that this article presents : https://www.farnamstreetblog.com/2017/03/scientific-concepts...
  Original article : https://www.edge.org/response-detail/27023

mark212 9 years ago

I can't follow the math but the presentation is gorgeous (Safari on a MacBook retina display). Really great, keep up the good work!

clairity 9 years ago

not following the math is understandable since it only takes a little unfamiliarity to make math seem hard to understand.
but if you've studied linear systems a bit, the math is laid out really well for understanding and turning into practical applications (which seems to be the whole point of distill.pub).
many math articles, in contrast, seem to obfuscate the "why" and the "what for" and concentrate on the theory and derivation (e.g., many wikipedia math articles). that might be great for mathematical elegance (or cynically, oneupsmanship), but not so great for people trying to build things using that valuable knowledge.
hardmaru 9 years ago
Which part of the math did you find difficult to follow?
- ABCLAW 9 years ago
  
  Better question: What background is the reader expected to have?
  Until Xi, not a single variable is defined prior to inclusion into an expression. Even Alpha and Beta are only defined in the header diagram rather than the body of the text. Also, why are the iteration notations in the superscript rather than subscript?
  And before someone chimes in and says that rudiments are necessary to understand this work, no they aren't. The logical steps here are exceptionally simple (and intuitive - as the introduction might lead you to believe) once you get past the delivery. This is a fantastic article that could become very accessible with the proper notation housekeeping.
  
  1 reply →

colmvp 9 years ago

I have to say that the dynamics of momentum diagram is a thing of real beauty. The whole paper felt a little NYTimes like and then of course, I see that Shan Carter helped a little bit with it!

simplynot 9 years ago

But simply not on Firefox

colah3 9 years ago
Thanks for pointing that out! We fixed the diagram bug in firefox. There's still bad performance for ~30s after page load -- we're looking into why that's happening -- but after that the page seems to work well.
- detaro 9 years ago
  
  The generating of the nice formulas from TeX syntax seems to eat most of those 30s for me - maybe you could do that during the build instead of pushing that work to the client?
- simplynot 9 years ago
  
  Cool that you fixed it already! I should have started with some praise: great selection of articles
saosebastiao 9 years ago

Wow...pretty amazing how easily that site killed my desktop.
mcthorogood 9 years ago

The site killed my Firefox on Ubuntu.
t0mk 9 years ago

it spun my firefox hard

amelius 9 years ago

Curious, has this method been used for solving linear systems? How would it perform e.g. against conjugate gradient?

And how would it perform for non-positive-definite systems?

gabrielgoh 9 years ago
Author here - yes! It can be used to solve symmetric PSD Systems, as the solution of Ax = b is also the minimizer of 0.5*x'Ax - b'x. Conjugate gradient can be seen as a special form of momentum with adaptive alphas and betas
- chestervonwinch 9 years ago
  
  > Conjugate gradient can be seen as a special form of momentum
  Just to be clear, though, CG doesn't use the negative gradients as search directions, as steepest descent would.

throwaway71958 9 years ago

Some of the multi-author articles on Distill have a very important (IMO) innovation. They quantify precisely the contribution each author has made to the article. I would like to see this become norm in scientific papers, so on the one hand it'd be clear who to ask questions if they arise, and on the other the various dignitaries won't get their honorary spot on the authors list of papers they were barely involved with scientifically.

colah3 9 years ago

As others mentioned, this is common in other fields, it's just not done in machine learning.
We've been reading through the policies of lots of journals that seem thoughtfully run and borrowing good ideas. :P
jast 9 years ago

Other journals, e.g. Nature, already had this. It's not a Distil innovation. But I agree, this should be the norm in journals/confs/etc.

Paul-ish 9 years ago

For those that don't read materials about optimization as much as they maybe should, what is "w⋆"? It used without introduction and I don't know what it is. Perhaps this a convention I am not aware of?

blt 9 years ago

Optimal w. It is a convention but still ought to be introduced.
gabrielgoh 9 years ago

the commenters below are correct - but I will push a change for this right away! It is my fault for not introducing it.
foota 9 years ago

In general, something star is the optimal value.

poppingtonic 9 years ago

I'm really loving the choice of articles, especially since you're just getting started.

Edit: I'm referring to the journal, not the author.

colah3 9 years ago
Thanks! We're just lucky to have authors like Gabe (@gabrielgoh) come to us with incredible articles. :)
- stared 9 years ago
  
  How to know if an article would fit there? For example, I was thinking about adjusting my http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html (already with some interactive components) or writing about RoI polling (https://deepsense.io/region-of-interest-pooling-explained/ - by my colleague, but more interactive).
  Would it be on-topic? (After changing style accordingly.)
  
  4 replies →
- brut 9 years ago
  
  I'm curious, did the author write the whole article including figures, or did someone else give life to the figures?
  I can see this type of interactive journal becoming very popular in other fields, but not if the author has to create the diagrams him/herself.
  
  3 replies →
jkimmel 9 years ago

Likewise, I'm very impressed with the early article selection. The site itself is beautiful, and interactive figures like the one at the top of this link are incredibly helpful.
Overall -- huge fan!

hatsunearu 9 years ago

Pretty cool, it started using stuff about classical control theory. I always kinda missed that classical controls weren't really brought up in the discussion of gradient descent.

PDoyle 9 years ago

People, listen. "Damping" means to reduce the amplitude of an oscillation. "Dampening" means to make something wetter.

</rant>

pizza 9 years ago

Related? Flat minima

http://www.bioinf.jku.at/publications/older/3304.pdf

gabrielgoh 9 years ago

This is indeed related. See the section on polynomial regression!

Animats 9 years ago

Hm. So that helps with high-frequency noise. Any progress on what to do when the dimensions are of vastly different scales? I have an old physics engine which had to solve about 20-value nonlinear differential equations. During a collision, the equations go stiff, and some dimensions may be 10 orders of magnitude steeper than others. Gradient descent then faces very steep knife edges. This is called a "stiff system" numerically.

gabrielgoh 9 years ago
Author here - I believe the problem of a "stiff system" you're referring to is exactly the problem of pathological curvature!
Some points not touched on in the article. If the individual dimensions are of different scales, this problem can be easily fixed with a diagonal preconditioner. Even something like ADAM or Adagrad (unconventional, I know, in this domain) can be used.
There's also a small industry around more sophisticated preconditioners for the linear systems in PDEs, see Multigrid, for example, or preconditioned conjugate gradient.
- Animats 9 years ago
  
  The stiffness may be local. It definitely is in a physical simulation for hard collisions. Machine learning data is usually normalized into [0..1], so if you get a really steep slope, something is pathological.
bigger_cheese 9 years ago

I'm not an expert on anything covered in the article but we have a similar physics based model at my work (complex non-linear equations) we use a technique called Sequential Quadratic Programming (SQP) to find an optimal solution. My understanding is that this gives better results than using gradient descent but will only work if the functions are continuous.
This could be worth looking into for you.

fabmilo 9 years ago

If you are curious to see the code to produce that post you can check it out here: https://github.com/distillpub/post--momentum I was surprised to see that each post has its own html page and javascript library. I was expecting to see some form of rendering engine and a common javascript library.

codekilla 9 years ago

It would be really, really great if you could somehow hook this up to Discourse so people could comment on and ask questions about the article. Allowing people to ask questions and having others answer like MathOverflow would I think bring a lot more clarity. Many different kinds of people want to understand material like this but may need the math unpacked in different ways.

ktta 9 years ago
I don't see how Discourse will be better than something like HN or reddit. There's also a submission to Reddit[1].
With discourse, I think there will be more noise and a lot of time will be wasted scrolling through unnecessary replies. What's good about the thread-like nature of HN/Reddit is that you'll have proper context and the rating system does its job so everyone's time won't be wasted.
Questions can be answered here and on Reddit too. I think Reddit can be sometimes more helpful than HN when it comes to answering 'easy' questions so I would go there if you have any simple questions.
[1]: https://www.reddit.com/r/MachineLearning/comments/63f3uk/r_w...
- codekilla 9 years ago
  
  I disagree. The most useful replies can be upvoted to deal with noise, and using mathematical formulas to help explain responses is absolutely crucial. I can't tell you how helpful it has been to be looking at an obscure proof, post a question to Math Overflow, and have the answer explained in an intuitive way with reference to the symbols and notation used.
  These articles on distill I believe could greatly benefit from this. Let the community help distill.
  
  2 replies →
- detaro 9 years ago
  
  HN doesn't work if someone has a question in 2 weeks. Both HN and reddit have an incredible skew towards current things (sort of in the term news aggregator, although current = "whatever was recently submitted", not necessarily = "news"), which doesn't really fit posts like this that are relevant for longer. A subreddit might help, but even there things fall off after a while, regardless of the discussion status.

xapata 9 years ago

> The added inertia acts both as a smoother and an accelerator

    momentum = mass * velocity

    force = mass * acceleration

    momentum = (force / acceleration) * velocity

So, it looks to me like momentum is inversely related to acceleration. It doesn't seem right to call momentum an "accelerator".

colah3 9 years ago
Hi! This article is about momentum in the mathematical field of optimization. Acceleration also refers to a phenomenon in optimization. While there are deep connections to their physical analogues, they aren't aren't quite the same thing.
If you want to make your analogy work, the momentum algorithm adds mass to an object. In terms of literal acceleration at any point in time it is neutral, but the added mass causes you to get through difficult areas much faster.
That said, much of this article is moving away from that physical analogy. :)
- xapata 9 years ago
  
  The use of the term "accelerator" is misleading as it's introduced during the physical metaphor. You should consider editing that sentence to clarify your meaning:
  The added inertia smooths out variation in velocity, dampening oscillations and causing us to barrel through narrow valleys, small humps and local minima. Keeping our speed steadier, we arrive at the global optimum faster.
  Note the alliteration :-)
  > momentum accelerates convergence
  Here it's more clear that momentum is accelerating convergence, not the "heavy ball" itself.
  > inertia acts both as a smoother and an accelerator
  On the second read, it's more clearly a contradiction. Speed can't be both held steadier and accelerated simultaneously. If you meant that momentum alternately smooths and accelerates, then it's even more strange. For that behavior, some sort of motor would be a more appropriate metaphor.
  
  3 replies →
yorwba 9 years ago
But velocity is the integral of acceleration, and for constant acceleration, this means
velocity = 1/2*acceleration*time^2 momentum = 1/2*force*time^2 = 1/2*mass*acceleration*time^2
So momentum is actually proportional to acceleration in this case.
- xapata 9 years ago
  
  Hmm. Think of the response that acceleration would have, all else equal, if momentum increased.
  There's two ways to change momentum -- velocity or mass. If mass increases, acceleration decreases. If velocity increases, acceleration remains the same.
  Edit: You've got a bit of an endogeneity problem in your equation. And I guess I do, too.
  momentum = 1/2 * mass * acceleration * time ^ 2
  Since momentum is equivalent to mass times velocity, one of its components is on both sides of the equation simultaneously. My original equation had the same problem, except with velocity instead of mass.
  So... I think "momentum" here is leading us a bit in the wrong direction. What the author should have been saying is "mass". After all, the metaphor was "a heavy ball", emphasizing the weight. The initial velocity was unchanged across the comparison.
  It bugs me that so many domains misuse the term "momentum". Business people confuse momentum for velocity. Now machine learning folks are confusing momentum for mass. It's common for people to use a component of the whole as a stand-in term, like "my wheels" instead of "my car". Less common to go the other way: "my arm hurts" instead of "my elbow hurts". In this machine learning case the imprecision of the metaphor harms understanding.
  
  1 reply →

jchrisa 9 years ago

Is this related to the way bias frequency works in analog audio recording? https://en.wikipedia.org/wiki/Tape_bias

olleromam91 9 years ago

That's exactly what came to my mind as well. Can't answer your question though!

andai 9 years ago

The math here is beyond my reasoning, but I loved playing with the sliders!

panic 9 years ago

Turn the step size and momentum to maximum for some wonderfully glitchy chaos! (and a demonstration of why forward Euler integration only works well with relatively small step sizes)

jacobush 9 years ago

Am I the only one who drew parallels to real life and how "just do stuff" often works better than the deliberate, slow step by step process?

soVeryTired 9 years ago

In your polynomial regression example, I can't follow what you mean by p_i = \xi \rightarrow \xi^{i-1} when you're setting up the model.

gabrielgoh 9 years ago
I just mean p_i(\xi) = \xi^{i-1}. The notation is a little cumbersome here, but it pays of in the second equation
- soVeryTired 9 years ago
  
  More polynomial regression questions :) Are you using the optimal step size you derived in the previous section? If so, why don't the first and last eigendirections converge at once? If not, doesn't it suggest that there's a trade-off between speed of convergence and the ability to stop early?
  In general, this is a nicely written article, though. Good work!
  
  1 reply →
- soVeryTired 9 years ago
  
  Sorry, got it. The fact that you introduce the data as \xi_i a line earlier might throw the reader a little.

jxy 9 years ago

It seems have no improvement for badly conditioned problems, i.e. k >> 1. The convergence rate is 1 with or without the damped momentum.

c517402 9 years ago

This looks like gradient decent passed through a Kalman filter. Which, on reflection, seems like a good idea for overcoming ripples.

tnecniv 9 years ago

Figured I'd mention since I saw the author and an editor in here: in footnote 4, the LaTeX isn't rendering (chrome, OSX).

cuca_de_chumbo 9 years ago

isn't this akin, in effect, to successive over relaxation? https://en.wikipedia.org/wiki/Successive_over-relaxation

under-relax, converge real slowly

over-over-relax, oscillate

just-right-over-relax, get fast convergence

gabrielgoh 9 years ago

Its not quite the same. SOR is closer to coordinate descent in the way it acts

the8472 9 years ago

Is the decreasing momentum related to the temperature in simulated annealing?

retox 9 years ago

Page killed my browser.

cee_el123 9 years ago

So much beauty in the presentation.

tempodox 9 years ago

I was fully expecting an article about some braindead product that nobody needs, called Momentum. Imagine my surprise finding physics and a healthily low percentage of BS.

cgriswald 9 years ago

This title, and some others on the main page, made me realize that while a title might make sense in the context of its own site, we are collectively really bad at using titles that are appropriate for passing around on the internet. This article (like many others) could have been about just about anything based on the title alone.
dnautics 9 years ago
it's not really physics so much as numerical methods.
- malmsteen 9 years ago
  
  your contribution is appreciated