The Truth of Sisyphus
  • Introduction
  • Deep Learning
    • Basics
      • Hinge Loss
      • Regularizations
      • Linear Classification
      • Multi-Class and Cross Entropy Loss
      • Batch Norm and other Normalizations
      • Optimization
      • Optimization Functions
      • Convolution im2col
      • Activation Functions
      • Derivatives
        • Derivatives of Softmax
        • A Smooth (differentiable) Max Function
      • Model Ensemble
      • Layers Python Implementation
    • Classification
      • Mobile friendly networks
      • Non-local Neural Networks
      • Squeeze-and-Excitation Networks
      • Further Attention Utilization -- Efficience & Segmentation
      • Group Norm
      • ShuffleNet V2
    • Segmentation
      • Several Instance Segmentation
      • A Peek at Semantic Segmentation
      • Design Choices for Mobile Friendly Deep Learning Models, Semantic Segmentation
      • Efficient Video Object Segmentation via Network Modulation
      • BiSeNet
      • DeepLabV3+
    • Detection
      • CornerNet
      • IoU-Net
      • Why smooth L1 is popular in BBox Regression
      • MTCNN-NCNN
      • DetNet
      • SSD Illustration
    • RNN Related
      • GRU vs LSTM
      • BERT
    • Reinforcement Learning
      • AutoML in Practice Review
      • DRL for optimal execution of profolio transaction
    • Multi-task
      • Multi-task Overview
      • What are the tricks in Multi-Task network design?
    • Neural Network Interpretation
      • Neuron Visualization
    • Deep Learning Frameworks
      • How does Caffe work
      • [Gluon] When to use (Hybrid)Sequential and (Hybrid)Block
      • Gluon Hybrid Intro
      • Gluon HybridBlocks Walk-Through
      • A quick tour of Torch internals
      • NCHW / NHWC in Pytorch
      • Static & Dynamic Computation Graph
    • Converting Between DL Frameworks
      • Things To Be Considered When Doing Model Converting
      • Caffe to TensorFlow
    • Computation Graph Optimization
      • Two ways of TensorRT to optimize Neural Network Computation Graph
      • Customized Caffe Memory Optimization
      • NCNN Memory Optimization
      • Symbolic Programs Advantages: More Efficient, Reuse Intermediate Memory, Operation Folding
    • Deep Learning Debug
      • Problems caused by dead ReLU
      • Loss jumps to 87.3365
      • Common Causes of NANs During Training
    • Deployment
      • Efficient Convolution Operation
      • Quantization
    • What I read recently
      • Know Google the Paper Way
      • ECCV 2018
      • Neural Machine Translation
      • Street View OCR Extraction System
      • Teaching Machines to Draw
      • Pixel to Graph
      • Burst Image Deblurring
      • Material for Masses
      • Learning to Separate Object Sounds by Watching Unlabeled Video
    • Papers / Posts to be read
    • Dummy thoughts
  • Machine Learning
    • Classification
    • Regression
    • Clustering
    • Dimension Reduction
    • Metrics
    • Regularization
    • Bayesian Example
    • Machine Learning System Design
    • Recommendation
    • Essentials of Machine Learning
    • Linear Regression
    • Logistic Regression
      • Logistic Function
    • Gaussian Discriminant Analysis
    • Naive Bayes
    • SVM
    • MLE vs MAP
    • Boosting
    • Frequent Questions
    • Conclusion of Machine Learning
  • Python notes
    • Python _ or __ underscores usage
    • Python Multiprocess and Threading Differences
    • Heapq vs. Q.PriorityQueue
    • Python decorator
    • Understanding Python super()
    • @ property
    • Python __all__
    • Is Python List a Linked List or Array
    • What is the "u" in u'Hello world'
    • Python "self"
    • Python object and class
    • Python Class' Instance method, Class method, and Static Methods Demystified
    • Python WTF
    • Python find first value index in a list: [list].index(val)
    • Sort tuples, and lambda usecase
    • Reverse order of range()
    • Python check list is empty
    • Python get ASCII value from character
    • An A-Z of useful Python tricks
    • Python nested function variable scope
    • Python reverse a list
    • Python priority queue -- heapq
  • C++ Notes
    • Templates
    • std::string (C++) and char* (or c-string "string" for C)
    • C++ printf and cout
    • Class Member Function
    • Inline
    • Scope Resolution Operator ::
    • Constructor
    • Destructor
    • Garbage Collection is Critical
    • C++ Question Lists
  • Operating System
    • Basics
    • Mutex & Semaphore
    • Ticket Selling System
    • OS and Memory
    • Sort implementation in STL
    • Compile, link, loading & run
    • How to understand Multithreading and Multiprocessing from the view of Operating System
  • Linux & Productivity
    • Jupyter Notebook on Remote Server
    • Nividia-smi monitoring
  • Leetcode Notes
    • Array
      • 11. Container With Most Water
      • 35. Search Insert Position
    • Linked List
      • Difference between Linked List and Array
      • Linked List Insert
      • Design of Linked List
      • Two Pointers
        • 141. Linked List Cycle
        • 142. Linked List Cycle II
        • 160. Intersection of two Linked List
        • 19. Remove N-th node from the end of linked list
      • 206. Reverse Linked List
      • 203. Remove Linked List Elements
      • 328. Odd Even Linked List
      • 234. Palindrome Linked List
      • 21. Merge Two Sorted Lists
      • 430. Flatten a Multilevel Doubly Linked List
      • 430. Flatten a Multilevel Doubly Linked List
      • 708. Insert into a Cyclic Sorted List
      • 138. Copy List with Random Pointer
      • 61. Rotate List
    • Binary Tree
      • 144. Binary Tree Preorder Traversal
      • 94. Binary Tree Iterative In-order Traverse
    • Binary Search Tree
      • 98. Validate Binary Search Tree
      • 285. Inorder Successor in BST
      • 173. Binary Search Tree Iterator
      • 700. Search in a Binary Search Tree
      • 450. Delete Node in a BST
      • 701. Insert into a Binary Search Tree
      • Kth Largest Element in a Stream
      • Lowest Common Ancestor of a BST
      • Contain Duplicate III
      • Balanced BST
      • Convert Sorted Array to Binary Search Tree
    • Dynamic Programming
      • 198. House Robber
      • House Robber II
      • Unique Path
      • Unique Path II
      • Best time to buy and sell
      • Partition equal subset sum
      • Target Sum
      • Burst Ballons
    • DFS
      • Clone Graph
      • General Introduction
      • Array & String
      • Sliding Window
  • Quotes
    • Concert Violinist Joke
    • 船 Ship
    • What I cannot create, I do not understand
    • Set your course by the stars
    • To-do list
Powered by GitBook
On this page
  • Vanila SGD problem:
  • SGD with Momentum
  • Nesterov Momentum
  • AdaGrad
  • RMSProp
  • Adam
  • Second Order Optimization
  1. Deep Learning
  2. Basics

Optimization

PreviousBatch Norm and other NormalizationsNextOptimization Functions

Last updated 6 years ago

Think about walking around a large valley, find the bottom of this valley. Or teleporting around the Valley, each batch is the area we can see. And if we see large enough and good enough, we could just teleport to the best spot we could see. It is very hard to see an optimal way. So we use an iterative method, to improve it every time.

  1. Random search: having a set of W, and just TRY. And using it would be horrible and need lifetime to finish.

  2. Using the local geometry. Maybe we can’t just see the bottom of the valley, bu use your foot to feel the slope of the ground and take which way will take you a bit down to the valley.

    • So what is the slope? The slope is the derivative of the loss function.

    • Gradient: it will give you the direction of the greatest increase to the target function; and if you look at the negative gradient, you will find the direction of greatest decrease to the target function. It’s a vector sharing the same shape of the parameters, and each slot of it representing how much loss will be changed if the parameter moves a infinitesimal amount in the direction.

    • Numerical & Analytic gradient: the Numerical is approximate, slow and easy to write; but the Analytic is exact, fast and error-prone.

    • Gradient Check: Always use Analytic gradient for training; but check implementation with numerical gradient. This is called gradient check.

Vanila SGD problem:

  1. Taco shape: zigzag on steep gradient, slow progress on shallow dimension

2. Local minima, saddle point: get stuck.

In one dimension problem, it seems that local minima is a bigger problem than saddle point, but in practice, in millions of parameters, it is opposite. For 100 million parameters, saddle point means, around current position, some parameters want to go up and some want to go down, which is quite normal for 100 million dimensions. And local minimal means, around current position, all 100 million parameters want to go down, which seems pretty rare. So when you are training a neural network , the problem more causes from saddle point rather that local minimal.

Not only being at saddle point is a problem, actually, when being around saddle point, the gradient isn't exact zero but is quite small and it will make us have very, very slow progress. This is actually a big problem.

3. Estimating gradient from mini batches can be noisy.

We are not getting the true target gradient at each time step, instead just getting a noisy estimate of our current point.

SGD with Momentum

For taco shape loss space, the gradient update is quite small in horizontal direction, but since SGD + Momentum is using rho * Vt-1 + GRADIENTt, the actual update velocity might be monotonically accumulated and larger than the actual gradient.

For noisy gradient, the momentum term will help to make the zigzag gradient canceling each other.

At local minimal or saddle point, even if the gradient is zero, since thanks to the velocity we have built up, the ball (current position) will still rolling down.

Intuitively, the velocity is kind of a weighted sum of your gradients you've seen over time and the more recent gradients are weighted heavier. Can be thought as a smooth moving average of your recent gradients, with a kind of exponentially decaying on your gradients going back in time.

Nesterov Momentum

Normally if you havE a loss function, you want to calculate our loss and your gradient at the same point. And the original form of Nesterov is quite annoying for computing these two.

For implementation, the parameter vector we are actually storing is always the ahead version.

If looking more precisely, when we modifying current parameters, we are adding the current velocity, and a weighted difference between current velocity and previous velocity. So Nesterov momentum is incorporating some kind of error-correcting term between current and previous velocity.

如果有非常 sharp 的 local minimal,SGD momentum 和 Nesterov 是不是就会 overshot 它,然后错过呢?

事实上,如果存在非常 sharp 的 minimal,我们可能就不想要它。因为假想有两个 minimal,其中一个 flat 而另一个 sharp,sharp 的那个可能 overfitting 了当前的数据集。如果我们 double training set,sharp 的那个可能就不能得到好的 testing 结果了,而 flat 的更可能。因此再选择 local minimal 时,我们可能就是想要 flat 的一些的。因此 momentum 机制下的 sharp minimal overshot 其实是一种好的 feature 而不是 bug。

AdaGrad

The idea is that if we have two directions and one always has small gradients and another always has large gradients, small one will be divided by a small number so we'll accelerate the movement anD large one will be dvded by a large number so we'll kind of slow down along this wiggling direction.

Being divided by an accumulated square gradients will results in smaller and smaller step size. In convex case, this is actually really good, because when approaching the basin, we want the update step to be small. But in non-convex case, if meeting a saddle point, this might let us get stuck and this Is not good.

RMSProp

It's like a momentum on squared gradients but not on gradients itself.

Because these squared gradients are leaky, it kind of solve the problem that it always slow down when you might not want to.

Adam

Problem: at first step, previous first and second momentum are zero, and current 1st momentum will be (1 - 0.9) * dx, current 2nd momentum will be (1 - 0.9) * dx * dx, which would be very small. And this might cause a very large first step size and in bad situation, your initialization might be completely messed up and you are in a very bad part of the objective landscape and just can not converge back.

If we actually plot these optimization functions out, you can see that Adam kind of combines both elements of SGD momentum and RMSProp. It over shots like SGD momentum but not quite as much as SGD, and it also has this similar behavior of RMSProp of kind of trying to curve to make equal progress along all dimensions.

Second Order Optimization

Newton method

Hessian has O(N^2) elements and inverting takes O(N^3). N usually is (Tens or hundreds of ) Millions. hundreds of millions square is too big and you can not fit it into memory.

Quasi-Newton method

An approximation inverse of Hessian.

L-BFGS

Does not form/store thE full inverse Hessian

It works well in full batch but not very well for mini-batch setting.

Sometimes used in Style Transfer

x-axis: one parameter, y-axis: loss value .right top: local minima, right bottom: saddle point; both of them has zero gradients.
two forms of Nesterov Momentum.