The Truth of Sisyphus
  • Introduction
  • Deep Learning
    • Basics
      • Hinge Loss
      • Regularizations
      • Linear Classification
      • Multi-Class and Cross Entropy Loss
      • Batch Norm and other Normalizations
      • Optimization
      • Optimization Functions
      • Convolution im2col
      • Activation Functions
      • Derivatives
        • Derivatives of Softmax
        • A Smooth (differentiable) Max Function
      • Model Ensemble
      • Layers Python Implementation
    • Classification
      • Mobile friendly networks
      • Non-local Neural Networks
      • Squeeze-and-Excitation Networks
      • Further Attention Utilization -- Efficience & Segmentation
      • Group Norm
      • ShuffleNet V2
    • Segmentation
      • Several Instance Segmentation
      • A Peek at Semantic Segmentation
      • Design Choices for Mobile Friendly Deep Learning Models, Semantic Segmentation
      • Efficient Video Object Segmentation via Network Modulation
      • BiSeNet
      • DeepLabV3+
    • Detection
      • CornerNet
      • IoU-Net
      • Why smooth L1 is popular in BBox Regression
      • MTCNN-NCNN
      • DetNet
      • SSD Illustration
    • RNN Related
      • GRU vs LSTM
      • BERT
    • Reinforcement Learning
      • AutoML in Practice Review
      • DRL for optimal execution of profolio transaction
    • Multi-task
      • Multi-task Overview
      • What are the tricks in Multi-Task network design?
    • Neural Network Interpretation
      • Neuron Visualization
    • Deep Learning Frameworks
      • How does Caffe work
      • [Gluon] When to use (Hybrid)Sequential and (Hybrid)Block
      • Gluon Hybrid Intro
      • Gluon HybridBlocks Walk-Through
      • A quick tour of Torch internals
      • NCHW / NHWC in Pytorch
      • Static & Dynamic Computation Graph
    • Converting Between DL Frameworks
      • Things To Be Considered When Doing Model Converting
      • Caffe to TensorFlow
    • Computation Graph Optimization
      • Two ways of TensorRT to optimize Neural Network Computation Graph
      • Customized Caffe Memory Optimization
      • NCNN Memory Optimization
      • Symbolic Programs Advantages: More Efficient, Reuse Intermediate Memory, Operation Folding
    • Deep Learning Debug
      • Problems caused by dead ReLU
      • Loss jumps to 87.3365
      • Common Causes of NANs During Training
    • Deployment
      • Efficient Convolution Operation
      • Quantization
    • What I read recently
      • Know Google the Paper Way
      • ECCV 2018
      • Neural Machine Translation
      • Street View OCR Extraction System
      • Teaching Machines to Draw
      • Pixel to Graph
      • Burst Image Deblurring
      • Material for Masses
      • Learning to Separate Object Sounds by Watching Unlabeled Video
    • Papers / Posts to be read
    • Dummy thoughts
  • Machine Learning
    • Classification
    • Regression
    • Clustering
    • Dimension Reduction
    • Metrics
    • Regularization
    • Bayesian Example
    • Machine Learning System Design
    • Recommendation
    • Essentials of Machine Learning
    • Linear Regression
    • Logistic Regression
      • Logistic Function
    • Gaussian Discriminant Analysis
    • Naive Bayes
    • SVM
    • MLE vs MAP
    • Boosting
    • Frequent Questions
    • Conclusion of Machine Learning
  • Python notes
    • Python _ or __ underscores usage
    • Python Multiprocess and Threading Differences
    • Heapq vs. Q.PriorityQueue
    • Python decorator
    • Understanding Python super()
    • @ property
    • Python __all__
    • Is Python List a Linked List or Array
    • What is the "u" in u'Hello world'
    • Python "self"
    • Python object and class
    • Python Class' Instance method, Class method, and Static Methods Demystified
    • Python WTF
    • Python find first value index in a list: [list].index(val)
    • Sort tuples, and lambda usecase
    • Reverse order of range()
    • Python check list is empty
    • Python get ASCII value from character
    • An A-Z of useful Python tricks
    • Python nested function variable scope
    • Python reverse a list
    • Python priority queue -- heapq
  • C++ Notes
    • Templates
    • std::string (C++) and char* (or c-string "string" for C)
    • C++ printf and cout
    • Class Member Function
    • Inline
    • Scope Resolution Operator ::
    • Constructor
    • Destructor
    • Garbage Collection is Critical
    • C++ Question Lists
  • Operating System
    • Basics
    • Mutex & Semaphore
    • Ticket Selling System
    • OS and Memory
    • Sort implementation in STL
    • Compile, link, loading & run
    • How to understand Multithreading and Multiprocessing from the view of Operating System
  • Linux & Productivity
    • Jupyter Notebook on Remote Server
    • Nividia-smi monitoring
  • Leetcode Notes
    • Array
      • 11. Container With Most Water
      • 35. Search Insert Position
    • Linked List
      • Difference between Linked List and Array
      • Linked List Insert
      • Design of Linked List
      • Two Pointers
        • 141. Linked List Cycle
        • 142. Linked List Cycle II
        • 160. Intersection of two Linked List
        • 19. Remove N-th node from the end of linked list
      • 206. Reverse Linked List
      • 203. Remove Linked List Elements
      • 328. Odd Even Linked List
      • 234. Palindrome Linked List
      • 21. Merge Two Sorted Lists
      • 430. Flatten a Multilevel Doubly Linked List
      • 430. Flatten a Multilevel Doubly Linked List
      • 708. Insert into a Cyclic Sorted List
      • 138. Copy List with Random Pointer
      • 61. Rotate List
    • Binary Tree
      • 144. Binary Tree Preorder Traversal
      • 94. Binary Tree Iterative In-order Traverse
    • Binary Search Tree
      • 98. Validate Binary Search Tree
      • 285. Inorder Successor in BST
      • 173. Binary Search Tree Iterator
      • 700. Search in a Binary Search Tree
      • 450. Delete Node in a BST
      • 701. Insert into a Binary Search Tree
      • Kth Largest Element in a Stream
      • Lowest Common Ancestor of a BST
      • Contain Duplicate III
      • Balanced BST
      • Convert Sorted Array to Binary Search Tree
    • Dynamic Programming
      • 198. House Robber
      • House Robber II
      • Unique Path
      • Unique Path II
      • Best time to buy and sell
      • Partition equal subset sum
      • Target Sum
      • Burst Ballons
    • DFS
      • Clone Graph
      • General Introduction
      • Array & String
      • Sliding Window
  • Quotes
    • Concert Violinist Joke
    • 船 Ship
    • What I cannot create, I do not understand
    • Set your course by the stars
    • To-do list
Powered by GitBook
On this page
  1. Deep Learning
  2. Basics

Activation Functions

PreviousConvolution im2colNextDerivatives

Last updated 6 years ago

Sigmoid

  1. Have linear region around 0 input, but squashes activation numbers to [0, 1]

  2. But Saturated neurons "kill" the gradients

  3. Sigmoid gradient is not zero centered, then all activations would be positive; in y = W*x + b, W's derivative is x and it means dL/dW would always be same +/- for all parameters. All weights either increase together or decrease together.

4. Why not zero centered is bad? When dL/dW would always be same +/- for all parameter, to reach the hypothetical optimal w, we have to do multiple zig zag updates to arrive there. Because the gradients (the update directions, the red arrow directions) are only allowed in the right-top and left-bottom quadrants. These would be very inefficient gradient updates.

5. So this is also why, in general, we want to have zero mean data. So that we have both positive and negative values and won't get into this all W moving in the same direction problem.

6. exp() is a bit computation expensive.

tanh (pronounced as tan-h, h like h i j k l m n)

  1. Zero centered (pros)

  2. Still kills gradients when saturated (cons)

ReLU (pronounced as re-lu, re like repeat)

  1. f(x) = max(0, x), computation efficient (pros)

  2. Does not saturate in + region (pros)

  3. Converges much faster than sigmoid/tanh in practice (e.g. 6x)

  4. AlexNet in 2012 uses ReLU in their experience and it is the first major CNN doing well on ImageNet and large-scale data. Since then, ReLU has started been used a lot. (pros)

  5. Not zero-centered output (cons)

  6. In negative half, there would be zero gradients; and at the origin, it's not defined. (cons)

  7. In an abstract way, each ReLU neuron is trying to separate the plane into half, and for half of the plane it will be activated. So there could be some ReLUs that wouldn't have any gradients update ( zero gradient) and be dead forever, like the red one. Two reasons for it:

    1. Bad initialization weights. If we have weights that happen to be unlucky and happen to be off the data cloud, like the red one. Then they are never going to have a data input that causes it to activate, so that never good gradient flow coming back -- dead forever.

    2. Learning rate is too high.

ReLU6: ( activation = min(max(features, 0), 6) why 6 here? )

This is useful in making the networks ready for fixed-point inference. If you unbound the upper limit, you lose too many bits to the Q part of a Q.f number. Keeping the ReLUs bounded by 6 will let them take a max of 3 bits (upto 8) leaving 4/5 bits for .f

It seems, then, that 6 is just an arbitrary value chosen according to the number of bits you want to be able to compress your network's trained parameters into. As per the "why" only the version with value 6 is implemented, I assume it's because that's the value that fits best in 8 bits, which probably is the most common use-case.

From :

this reddit thread
Sigmoid Derivative
The gradients (the update directions, the red arrow directions) are only allowed in the right-top and left-bottom quadrants.