Optimization

Think about walking around a large valley, find the bottom of this valley. Or teleporting around the Valley, each batch is the area we can see. And if we see large enough and good enough, we could just teleport to the best spot we could see. It is very hard to see an optimal way. So we use an iterative method, to improve it every time.

  1. Random search: having a set of W, and just TRY. And using it would be horrible and need lifetime to finish.

  2. Using the local geometry. Maybe we can’t just see the bottom of the valley, bu use your foot to feel the slope of the ground and take which way will take you a bit down to the valley.

    • So what is the slope? The slope is the derivative of the loss function.

    • Gradient: it will give you the direction of the greatest increase to the target function; and if you look at the negative gradient, you will find the direction of greatest decrease to the target function. It’s a vector sharing the same shape of the parameters, and each slot of it representing how much loss will be changed if the parameter moves a infinitesimal amount in the direction.

    • Numerical & Analytic gradient: the Numerical is approximate, slow and easy to write; but the Analytic is exact, fast and error-prone.

    • Gradient Check: Always use Analytic gradient for training; but check implementation with numerical gradient. This is called gradient check.

Vanila SGD problem:

  1. Taco shape: zigzag on steep gradient, slow progress on shallow dimension

2. Local minima, saddle point: get stuck.

In one dimension problem, it seems that local minima is a bigger problem than saddle point, but in practice, in millions of parameters, it is opposite. For 100 million parameters, saddle point means, around current position, some parameters want to go up and some want to go down, which is quite normal for 100 million dimensions. And local minimal means, around current position, all 100 million parameters want to go down, which seems pretty rare. So when you are training a neural network , the problem more causes from saddle point rather that local minimal.

Not only being at saddle point is a problem, actually, when being around saddle point, the gradient isn't exact zero but is quite small and it will make us have very, very slow progress. This is actually a big problem.

3. Estimating gradient from mini batches can be noisy.

We are not getting the true target gradient at each time step, instead just getting a noisy estimate of our current point.

SGD with Momentum

For taco shape loss space, the gradient update is quite small in horizontal direction, but since SGD + Momentum is using rho * Vt-1 + GRADIENTt, the actual update velocity might be monotonically accumulated and larger than the actual gradient.

For noisy gradient, the momentum term will help to make the zigzag gradient canceling each other.

At local minimal or saddle point, even if the gradient is zero, since thanks to the velocity we have built up, the ball (current position) will still rolling down.

Intuitively, the velocity is kind of a weighted sum of your gradients you've seen over time and the more recent gradients are weighted heavier. Can be thought as a smooth moving average of your recent gradients, with a kind of exponentially decaying on your gradients going back in time.

Nesterov Momentum

Normally if you havE a loss function, you want to calculate our loss and your gradient at the same point. And the original form of Nesterov is quite annoying for computing these two.

For implementation, the parameter vector we are actually storing is always the ahead version.

If looking more precisely, when we modifying current parameters, we are adding the current velocity, and a weighted difference between current velocity and previous velocity. So Nesterov momentum is incorporating some kind of error-correcting term between current and previous velocity.

如果有非常 sharp 的 local minimal,SGD momentum 和 Nesterov 是不是就会 overshot 它,然后错过呢?

事实上,如果存在非常 sharp 的 minimal,我们可能就不想要它。因为假想有两个 minimal,其中一个 flat 而另一个 sharp,sharp 的那个可能 overfitting 了当前的数据集。如果我们 double training set,sharp 的那个可能就不能得到好的 testing 结果了,而 flat 的更可能。因此再选择 local minimal 时,我们可能就是想要 flat 的一些的。因此 momentum 机制下的 sharp minimal overshot 其实是一种好的 feature 而不是 bug。

AdaGrad

The idea is that if we have two directions and one always has small gradients and another always has large gradients, small one will be divided by a small number so we'll accelerate the movement anD large one will be dvded by a large number so we'll kind of slow down along this wiggling direction.

Being divided by an accumulated square gradients will results in smaller and smaller step size. In convex case, this is actually really good, because when approaching the basin, we want the update step to be small. But in non-convex case, if meeting a saddle point, this might let us get stuck and this Is not good.

RMSProp

It's like a momentum on squared gradients but not on gradients itself.

Because these squared gradients are leaky, it kind of solve the problem that it always slow down when you might not want to.

Adam

Problem: at first step, previous first and second momentum are zero, and current 1st momentum will be (1 - 0.9) * dx, current 2nd momentum will be (1 - 0.9) * dx * dx, which would be very small. And this might cause a very large first step size and in bad situation, your initialization might be completely messed up and you are in a very bad part of the objective landscape and just can not converge back.

If we actually plot these optimization functions out, you can see that Adam kind of combines both elements of SGD momentum and RMSProp. It over shots like SGD momentum but not quite as much as SGD, and it also has this similar behavior of RMSProp of kind of trying to curve to make equal progress along all dimensions.

Second Order Optimization

Newton method

Hessian has O(N^2) elements and inverting takes O(N^3). N usually is (Tens or hundreds of ) Millions. hundreds of millions square is too big and you can not fit it into memory.

Quasi-Newton method

An approximation inverse of Hessian.

L-BFGS

Does not form/store thE full inverse Hessian

It works well in full batch but not very well for mini-batch setting.

Sometimes used in Style Transfer

Last updated