Comment by ekelsen

10 months ago

now try the same experiment in 1 billion dimensions.

2 comments

ekelsen

It's unclear if increasing the dimensionality is in itself a challenge, provided that the objective function is still convex with a unique global minima -- like these somewhat problematic Rosenbrock test objective functions used in examples in the article.

On another hand, if the objective function is very multimodal with many "red herring" local minima, perhaps an optimiser that is very good at finding the local minima might do worse in practice at globally optimising than an optimiser that sometimes "barrels" out of the basin of a local minima and accidentally falls into a neighbouring basin around a lower minima.

I ran a few numerical experiments using scipy's "rosen" test function [1] as the objective, in D=10,000 dimensions. This function has a unique global minimum of 0 which is attained at x* = 1_D. I set the initial guess as x0 := x* + eps_i, where for each element i=1,...d, eps_i is noise sampled from N(0, 0.05)

Repeating this over 100 trial problems, using the same initial guess x0 across each method during each trial, the average number of gradient evaluations required for convergence was

  'cg': 248
  'l-bfgs-b': 40
  'm-001-99': 3337

All methods converged in 100 / 100 trials.

m-001-99 is gradient descent with momentum using alpha=0.001 and beta=0.99 . Setting alpha=0.002 or higher causes momentum to fail to converge. The other two methods are scipy's cg & l-bfgs-b methods using default parameters (again, under the hood these two methods rely on a port of MINPACK2's dcsrch to determine the step size along the descent direction during each iteration, they're not using momentum updates or a fixed step size). I used l-bfgs-b instead of bfgs to avoid maintaining the dense D*D matrix for the approx inverse Hessian.

One point in momentum's favour was robustness to higher noise levels used to generate the initial guess -- if the noise level used to define the initial guess x0 is increased to N(0, 1) then I see the cg & l-bfgs-b methods fail to converge in around 20% of trial problems, while momentum fails a lower fraction of the time provided the fixed step size is set small enough, but still requires a very large number of gradient evaluations to converge.

[1] https://docs.scipy.org/doc/scipy/reference/generated/scipy.o...

ekelsen 10 months ago

Perhaps you took my comment too literally. Try it on a real neutral network, it doesn't work.