Comment by gabrielgoh

9 years ago

Very good question! I have considered this issue too. This form of weighting is the kind used in ADAM, and is qualitatively different from the updates described here. The tools of analysis in this article can be used to understand that iteration too, (this amounts to a different R matrix) and I would be curious to see if it too allows for a quadratic speedup.

[EDIT] As per halfling's comment, this is just a change of the learning rate by (1-beta)