The Truth of Sisyphus
  • Introduction
  • Deep Learning
    • Basics
      • Hinge Loss
      • Regularizations
      • Linear Classification
      • Multi-Class and Cross Entropy Loss
      • Batch Norm and other Normalizations
      • Optimization
      • Optimization Functions
      • Convolution im2col
      • Activation Functions
      • Derivatives
        • Derivatives of Softmax
        • A Smooth (differentiable) Max Function
      • Model Ensemble
      • Layers Python Implementation
    • Classification
      • Mobile friendly networks
      • Non-local Neural Networks
      • Squeeze-and-Excitation Networks
      • Further Attention Utilization -- Efficience & Segmentation
      • Group Norm
      • ShuffleNet V2
    • Segmentation
      • Several Instance Segmentation
      • A Peek at Semantic Segmentation
      • Design Choices for Mobile Friendly Deep Learning Models, Semantic Segmentation
      • Efficient Video Object Segmentation via Network Modulation
      • BiSeNet
      • DeepLabV3+
    • Detection
      • CornerNet
      • IoU-Net
      • Why smooth L1 is popular in BBox Regression
      • MTCNN-NCNN
      • DetNet
      • SSD Illustration
    • RNN Related
      • GRU vs LSTM
      • BERT
    • Reinforcement Learning
      • AutoML in Practice Review
      • DRL for optimal execution of profolio transaction
    • Multi-task
      • Multi-task Overview
      • What are the tricks in Multi-Task network design?
    • Neural Network Interpretation
      • Neuron Visualization
    • Deep Learning Frameworks
      • How does Caffe work
      • [Gluon] When to use (Hybrid)Sequential and (Hybrid)Block
      • Gluon Hybrid Intro
      • Gluon HybridBlocks Walk-Through
      • A quick tour of Torch internals
      • NCHW / NHWC in Pytorch
      • Static & Dynamic Computation Graph
    • Converting Between DL Frameworks
      • Things To Be Considered When Doing Model Converting
      • Caffe to TensorFlow
    • Computation Graph Optimization
      • Two ways of TensorRT to optimize Neural Network Computation Graph
      • Customized Caffe Memory Optimization
      • NCNN Memory Optimization
      • Symbolic Programs Advantages: More Efficient, Reuse Intermediate Memory, Operation Folding
    • Deep Learning Debug
      • Problems caused by dead ReLU
      • Loss jumps to 87.3365
      • Common Causes of NANs During Training
    • Deployment
      • Efficient Convolution Operation
      • Quantization
    • What I read recently
      • Know Google the Paper Way
      • ECCV 2018
      • Neural Machine Translation
      • Street View OCR Extraction System
      • Teaching Machines to Draw
      • Pixel to Graph
      • Burst Image Deblurring
      • Material for Masses
      • Learning to Separate Object Sounds by Watching Unlabeled Video
    • Papers / Posts to be read
    • Dummy thoughts
  • Machine Learning
    • Classification
    • Regression
    • Clustering
    • Dimension Reduction
    • Metrics
    • Regularization
    • Bayesian Example
    • Machine Learning System Design
    • Recommendation
    • Essentials of Machine Learning
    • Linear Regression
    • Logistic Regression
      • Logistic Function
    • Gaussian Discriminant Analysis
    • Naive Bayes
    • SVM
    • MLE vs MAP
    • Boosting
    • Frequent Questions
    • Conclusion of Machine Learning
  • Python notes
    • Python _ or __ underscores usage
    • Python Multiprocess and Threading Differences
    • Heapq vs. Q.PriorityQueue
    • Python decorator
    • Understanding Python super()
    • @ property
    • Python __all__
    • Is Python List a Linked List or Array
    • What is the "u" in u'Hello world'
    • Python "self"
    • Python object and class
    • Python Class' Instance method, Class method, and Static Methods Demystified
    • Python WTF
    • Python find first value index in a list: [list].index(val)
    • Sort tuples, and lambda usecase
    • Reverse order of range()
    • Python check list is empty
    • Python get ASCII value from character
    • An A-Z of useful Python tricks
    • Python nested function variable scope
    • Python reverse a list
    • Python priority queue -- heapq
  • C++ Notes
    • Templates
    • std::string (C++) and char* (or c-string "string" for C)
    • C++ printf and cout
    • Class Member Function
    • Inline
    • Scope Resolution Operator ::
    • Constructor
    • Destructor
    • Garbage Collection is Critical
    • C++ Question Lists
  • Operating System
    • Basics
    • Mutex & Semaphore
    • Ticket Selling System
    • OS and Memory
    • Sort implementation in STL
    • Compile, link, loading & run
    • How to understand Multithreading and Multiprocessing from the view of Operating System
  • Linux & Productivity
    • Jupyter Notebook on Remote Server
    • Nividia-smi monitoring
  • Leetcode Notes
    • Array
      • 11. Container With Most Water
      • 35. Search Insert Position
    • Linked List
      • Difference between Linked List and Array
      • Linked List Insert
      • Design of Linked List
      • Two Pointers
        • 141. Linked List Cycle
        • 142. Linked List Cycle II
        • 160. Intersection of two Linked List
        • 19. Remove N-th node from the end of linked list
      • 206. Reverse Linked List
      • 203. Remove Linked List Elements
      • 328. Odd Even Linked List
      • 234. Palindrome Linked List
      • 21. Merge Two Sorted Lists
      • 430. Flatten a Multilevel Doubly Linked List
      • 430. Flatten a Multilevel Doubly Linked List
      • 708. Insert into a Cyclic Sorted List
      • 138. Copy List with Random Pointer
      • 61. Rotate List
    • Binary Tree
      • 144. Binary Tree Preorder Traversal
      • 94. Binary Tree Iterative In-order Traverse
    • Binary Search Tree
      • 98. Validate Binary Search Tree
      • 285. Inorder Successor in BST
      • 173. Binary Search Tree Iterator
      • 700. Search in a Binary Search Tree
      • 450. Delete Node in a BST
      • 701. Insert into a Binary Search Tree
      • Kth Largest Element in a Stream
      • Lowest Common Ancestor of a BST
      • Contain Duplicate III
      • Balanced BST
      • Convert Sorted Array to Binary Search Tree
    • Dynamic Programming
      • 198. House Robber
      • House Robber II
      • Unique Path
      • Unique Path II
      • Best time to buy and sell
      • Partition equal subset sum
      • Target Sum
      • Burst Ballons
    • DFS
      • Clone Graph
      • General Introduction
      • Array & String
      • Sliding Window
  • Quotes
    • Concert Violinist Joke
    • 船 Ship
    • What I cannot create, I do not understand
    • Set your course by the stars
    • To-do list
Powered by GitBook
On this page
  1. Deep Learning
  2. Basics

Optimization Functions

1) 拉格朗日算子的正负值是否无关?

2) SMO for SVM? 为什么要两维

3)Batch Gradient Descend 中 h()? 为什么乘以 x

4)Newton 法:是否是不停的求切线,二阶是怎么来的?【凸优化 OK, 非凸有问题?mini-batch 会累积噪声】 GD:梯度下降的方向

  • Newton:

    • 梯度切线下降的方向(梯度变化最快的方向)

    • 但要计算二阶导数 — Hessian size 很大,而且运算量巨大(是 O(n^2)吗?)

    • 计算步长:阻尼牛顿

  • Quasi-Newton:

    • 拟合 Hessian / Hessian 的逆 (O(n^2) -> O(2 * n))(用两个 vector dot product 代替 Hessian 的逆(H^-1))

    • BFGS: 拟合 Hessian (H)

    • L-BFGS:不存拟合的 Hessian (B)

5)Conjugate Gradient

  • 类似 coordinate 方法:先转换成几个共轭的方向,再在共轭的方向上寻找 coordinate 下降

6)1.2.3.4 中 一阶动量/二阶动量是什么

  • 一阶动量决定方向;lr/二阶动量 决定最终步长

7) SGD + Nesterov Gt 为什么用 m(t-1) V(t - 1) 算

8)鞍点:两个方向的 gradient 都为 0?还是一个方向是最低了?

9)AdaGrad 中:一阶平方和是二阶吗?

10) AMSGrad: 保证 Vt >= Vt-1, 这样就不会后期 lr 又因 Vt 变小而使 lr 变大而震荡

11)decouple Nadam:为什么 weight decay 会在 optimization function 中?

12)Gradient noise?paper 名字

13) Warm restart: 会更好吗?snapshot ensemble:比 model ensemble 节约很多训练时间(一轮下来的到好几个局部最优)

14)LSTM model 训练 lr (为什么 lstm)

15)learn to learn,学习模型结构

16)参数越多,局部极值点越多;评价 local minimum:如果这个 minimum 非常 sharp 就不好(有争议)。

17)GPU 并行:

  • Disk —> cpu —> 显存 —> GPU cores;GPU cores 运算的是 im2col 后的小 patch

18)model parallelism & data parallelism

PreviousOptimizationNextConvolution im2col

Last updated 6 years ago