The Truth of Sisyphus
  • Introduction
  • Deep Learning
    • Basics
      • Hinge Loss
      • Regularizations
      • Linear Classification
      • Multi-Class and Cross Entropy Loss
      • Batch Norm and other Normalizations
      • Optimization
      • Optimization Functions
      • Convolution im2col
      • Activation Functions
      • Derivatives
        • Derivatives of Softmax
        • A Smooth (differentiable) Max Function
      • Model Ensemble
      • Layers Python Implementation
    • Classification
      • Mobile friendly networks
      • Non-local Neural Networks
      • Squeeze-and-Excitation Networks
      • Further Attention Utilization -- Efficience & Segmentation
      • Group Norm
      • ShuffleNet V2
    • Segmentation
      • Several Instance Segmentation
      • A Peek at Semantic Segmentation
      • Design Choices for Mobile Friendly Deep Learning Models, Semantic Segmentation
      • Efficient Video Object Segmentation via Network Modulation
      • BiSeNet
      • DeepLabV3+
    • Detection
      • CornerNet
      • IoU-Net
      • Why smooth L1 is popular in BBox Regression
      • MTCNN-NCNN
      • DetNet
      • SSD Illustration
    • RNN Related
      • GRU vs LSTM
      • BERT
    • Reinforcement Learning
      • AutoML in Practice Review
      • DRL for optimal execution of profolio transaction
    • Multi-task
      • Multi-task Overview
      • What are the tricks in Multi-Task network design?
    • Neural Network Interpretation
      • Neuron Visualization
    • Deep Learning Frameworks
      • How does Caffe work
      • [Gluon] When to use (Hybrid)Sequential and (Hybrid)Block
      • Gluon Hybrid Intro
      • Gluon HybridBlocks Walk-Through
      • A quick tour of Torch internals
      • NCHW / NHWC in Pytorch
      • Static & Dynamic Computation Graph
    • Converting Between DL Frameworks
      • Things To Be Considered When Doing Model Converting
      • Caffe to TensorFlow
    • Computation Graph Optimization
      • Two ways of TensorRT to optimize Neural Network Computation Graph
      • Customized Caffe Memory Optimization
      • NCNN Memory Optimization
      • Symbolic Programs Advantages: More Efficient, Reuse Intermediate Memory, Operation Folding
    • Deep Learning Debug
      • Problems caused by dead ReLU
      • Loss jumps to 87.3365
      • Common Causes of NANs During Training
    • Deployment
      • Efficient Convolution Operation
      • Quantization
    • What I read recently
      • Know Google the Paper Way
      • ECCV 2018
      • Neural Machine Translation
      • Street View OCR Extraction System
      • Teaching Machines to Draw
      • Pixel to Graph
      • Burst Image Deblurring
      • Material for Masses
      • Learning to Separate Object Sounds by Watching Unlabeled Video
    • Papers / Posts to be read
    • Dummy thoughts
  • Machine Learning
    • Classification
    • Regression
    • Clustering
    • Dimension Reduction
    • Metrics
    • Regularization
    • Bayesian Example
    • Machine Learning System Design
    • Recommendation
    • Essentials of Machine Learning
    • Linear Regression
    • Logistic Regression
      • Logistic Function
    • Gaussian Discriminant Analysis
    • Naive Bayes
    • SVM
    • MLE vs MAP
    • Boosting
    • Frequent Questions
    • Conclusion of Machine Learning
  • Python notes
    • Python _ or __ underscores usage
    • Python Multiprocess and Threading Differences
    • Heapq vs. Q.PriorityQueue
    • Python decorator
    • Understanding Python super()
    • @ property
    • Python __all__
    • Is Python List a Linked List or Array
    • What is the "u" in u'Hello world'
    • Python "self"
    • Python object and class
    • Python Class' Instance method, Class method, and Static Methods Demystified
    • Python WTF
    • Python find first value index in a list: [list].index(val)
    • Sort tuples, and lambda usecase
    • Reverse order of range()
    • Python check list is empty
    • Python get ASCII value from character
    • An A-Z of useful Python tricks
    • Python nested function variable scope
    • Python reverse a list
    • Python priority queue -- heapq
  • C++ Notes
    • Templates
    • std::string (C++) and char* (or c-string "string" for C)
    • C++ printf and cout
    • Class Member Function
    • Inline
    • Scope Resolution Operator ::
    • Constructor
    • Destructor
    • Garbage Collection is Critical
    • C++ Question Lists
  • Operating System
    • Basics
    • Mutex & Semaphore
    • Ticket Selling System
    • OS and Memory
    • Sort implementation in STL
    • Compile, link, loading & run
    • How to understand Multithreading and Multiprocessing from the view of Operating System
  • Linux & Productivity
    • Jupyter Notebook on Remote Server
    • Nividia-smi monitoring
  • Leetcode Notes
    • Array
      • 11. Container With Most Water
      • 35. Search Insert Position
    • Linked List
      • Difference between Linked List and Array
      • Linked List Insert
      • Design of Linked List
      • Two Pointers
        • 141. Linked List Cycle
        • 142. Linked List Cycle II
        • 160. Intersection of two Linked List
        • 19. Remove N-th node from the end of linked list
      • 206. Reverse Linked List
      • 203. Remove Linked List Elements
      • 328. Odd Even Linked List
      • 234. Palindrome Linked List
      • 21. Merge Two Sorted Lists
      • 430. Flatten a Multilevel Doubly Linked List
      • 430. Flatten a Multilevel Doubly Linked List
      • 708. Insert into a Cyclic Sorted List
      • 138. Copy List with Random Pointer
      • 61. Rotate List
    • Binary Tree
      • 144. Binary Tree Preorder Traversal
      • 94. Binary Tree Iterative In-order Traverse
    • Binary Search Tree
      • 98. Validate Binary Search Tree
      • 285. Inorder Successor in BST
      • 173. Binary Search Tree Iterator
      • 700. Search in a Binary Search Tree
      • 450. Delete Node in a BST
      • 701. Insert into a Binary Search Tree
      • Kth Largest Element in a Stream
      • Lowest Common Ancestor of a BST
      • Contain Duplicate III
      • Balanced BST
      • Convert Sorted Array to Binary Search Tree
    • Dynamic Programming
      • 198. House Robber
      • House Robber II
      • Unique Path
      • Unique Path II
      • Best time to buy and sell
      • Partition equal subset sum
      • Target Sum
      • Burst Ballons
    • DFS
      • Clone Graph
      • General Introduction
      • Array & String
      • Sliding Window
  • Quotes
    • Concert Violinist Joke
    • 船 Ship
    • What I cannot create, I do not understand
    • Set your course by the stars
    • To-do list
Powered by GitBook
On this page
  • TLDR (or the take away)
  • Abstract
  • Controversy between two groups
  • MLE - Maximum Likelihood Estimation
  • MAP - Maximum A Posteriori
  • Conclusion
  • Reference
  • Appendix
  1. Machine Learning

MLE vs MAP

PreviousSVMNextBoosting

Last updated 6 years ago

作者:夏飞 链接:https://zhuanlan.zhihu.com/p/32480810 来源:知乎

Common choice of prior is N(0,r^2I). Thus parameters have smaller norm and reduce overfitting. (ie., Bayesian logistic regression).

TLDR (or the take away)

  • 频率学派 - Frequentist - Maximum Likelihood Estimation (MLE,最大似然估计)

  • 贝叶斯学派 - Bayesian - Maximum A Posteriori (MAP,最大后验估计)

Abstract

现代机器学习的终极问题都会转化为解目标函数的优化问题,MLE和MAP是生成这个函数的很基本的思想,因此我们对二者的认知是非常重要的。这次就和大家认真聊一聊MLE和MAP这两种estimator。

Controversy between two groups

抽象一点来讲,频率学派和贝叶斯学派对世界的认知有本质不同:频率学派认为世界是确定的,有一个本体,这个本体的真值是不变的,我们的目标就是要找到这个真值或真值所在的范围;而贝叶斯学派认为世界是不确定的,人们对世界先有一个预判,而后通过观测数据对这个预判做调整,我们的目标是要找到最优的描述这个世界的概率分布。

在对事物建模时,用 表示模型的参数,请注意,解决问题的本质就是求 。那么:

(1) 频率学派:存在唯一真值 。举一个简单直观的例子--抛硬币,我们用 来表示硬币的bias。抛一枚硬币100次,有20次正面朝上,要估计抛硬币正面朝上的bias 。在频率学派来看, = 20 / 100 = 0.2,很直观。当数据量趋于无穷时,这种方法能给出精准的估计;然而缺乏数据时则可能产生严重的偏差。例如,对于一枚均匀硬币,即 = 0.5,抛掷5次,出现5次正面 (这种情况出现的概率是1/2^5=3.125%),频率学派会直接估计这枚硬币 = 1,出现严重错误。

(2) 贝叶斯学派: 是一个随机变量,符合一定的概率分布。在贝叶斯学派里有两大输入和一大输出,输入是先验 (prior)和似然 (likelihood),输出是后验 (posterior)。先验,即 ,指的是在没有观测到任何数据时对 的预先判断,例如给我一个硬币,一种可行的先验是认为这个硬币有很大的概率是均匀的,有较小的概率是是不均匀的;似然,即 ,是假设 已知后我们观察到的数据应该是什么样子的;后验,即 ,是最终的参数分布。贝叶斯估计的基础是贝叶斯公式,如下:

这里有两点值得注意的地方:

  • 随着数据量的增加,参数分布会越来越向数据靠拢,先验的影响力会越来越小

  • 如果先验是uniform distribution,则贝叶斯方法等价于频率方法。因为直观上来讲,先验是uniform distribution本质上表示对事物没有任何预判

MLE - Maximum Likelihood Estimation

Maximum Likelihood Estimation, MLE是频率学派常用的估计方法!

最后这一行所优化的函数被称为Negative Log Likelihood (NLL),这个概念和上面的推导是非常重要的!

我们经常在不经意间使用MLE,例如

  • 上文中关于频率学派求硬币概率的例子,其方法其实本质是由优化NLL得出。本文末尾附录中给出了具体的原因 :-)

  • 给定一些数据,求对应的高斯分布时,我们经常会算这些数据点的均值和方差然后带入到高斯分布的公式,其理论依据是优化NLL

  • 深度学习做分类任务时所用的cross entropy loss,其本质也是MLE

MAP - Maximum A Posteriori

Maximum A Posteriori, MAP是贝叶斯学派常用的估计方法!

再稍微补充几点:

  • 我们不少同学大学里学习概率论时,最主要的还是频率学派的思想,其实贝叶斯学派思想也非常流行,而且实战性很强

Conclusion

有的同学说:“了解这些没用,现在大家都不用了。”这种想法是不对的,因为这是大家常年在用的知识,是推导优化函数的核心,而优化函数又是机器学习 (包含深度学习) 的核心之一。这位同学有这样的看法,说明对机器学习的本质并没有足够的认识,而让我吃惊的是,竟然有不少其他同学为这种看法点赞。内心感到有点儿悲凉,也引发了我写这篇文章的动力,希望能帮到一些朋友 :-)

Reference

Appendix

为什么说频率学派求硬币概率的算法本质是在优化NLL?

求导数并使其等于零,得到

同样是抛硬币的例子,对一枚均匀硬币抛5次得到5次正面,如果先验认为大概率下这个硬币是均匀的 (例如最大值取在0.5处的Beta分布),那么 ,即 ,是一个distribution,最大值会介于0.5~1之间,而不是武断的 = 1。

假设数据 是i.i.d.的一组抽样, 。其中i.i.d.表示Independent and identical distribution,独立同分布。那么MLE对 的估计方法可以如下推导:

同样的,假设数据 是i.i.d.的一组抽样, 。那么MAP对 的估计方法可以如下推导:

其中,第二行到第三行使用了贝叶斯定理,第三行到第四行 可以丢掉因为与 无关。注意 其实就是NLL,所以MLE和MAP在优化时的不同就是在于先验项 。好的,那现在我们来研究一下这个先验项,假定先验是一个高斯分布,即

那么, 。至此,一件神奇的事情发生了 -- 在MAP中使用一个高斯分布的先验等价于在MLE中采用L2的regularizaton!

CMU的很多老师都喜欢用贝叶斯思想解决问题;我本科时的导师朱军老师也在做的工作,有兴趣可以关注一下。

[1] , UT Dallas.

[2] , CMU.

因为抛硬币可以表示为参数为 的Bernoulli分布,即

其中 = 1 表示第 次抛出正面。那么,

即 ,也就是出现正面的次数除以总共的抛掷次数。

贝叶斯深度学习
Bayesian Method Lecture
MLE, MAP, Bayes classification Lecture
P(head)
\theta
\theta
\theta
\theta
\theta
\theta
\theta
P(head)=\theta
P(head)
\theta
\theta
\theta
P(\theta)
P(X|\theta)
P(\theta|X)
P(\theta|X)
x_1, x_2, ..., x_n
X = (x_1, x_2, ..., x_n)
\theta
x_1, x_2, ..., x_n
X = (x_1, x_2, ..., x_n)
\theta
P(X)
\theta
-\log P(X|\theta )
- \log P(\theta)
-\log P(\theta) = \text{constant} + \frac{\theta^2}{2\sigma^2}
\theta
x_i
i
\hat{\theta} = \frac{\sum_{i=1}^n x_i}{n}