The Truth of Sisyphus
  • Introduction
  • Deep Learning
    • Basics
      • Hinge Loss
      • Regularizations
      • Linear Classification
      • Multi-Class and Cross Entropy Loss
      • Batch Norm and other Normalizations
      • Optimization
      • Optimization Functions
      • Convolution im2col
      • Activation Functions
      • Derivatives
        • Derivatives of Softmax
        • A Smooth (differentiable) Max Function
      • Model Ensemble
      • Layers Python Implementation
    • Classification
      • Mobile friendly networks
      • Non-local Neural Networks
      • Squeeze-and-Excitation Networks
      • Further Attention Utilization -- Efficience & Segmentation
      • Group Norm
      • ShuffleNet V2
    • Segmentation
      • Several Instance Segmentation
      • A Peek at Semantic Segmentation
      • Design Choices for Mobile Friendly Deep Learning Models, Semantic Segmentation
      • Efficient Video Object Segmentation via Network Modulation
      • BiSeNet
      • DeepLabV3+
    • Detection
      • CornerNet
      • IoU-Net
      • Why smooth L1 is popular in BBox Regression
      • MTCNN-NCNN
      • DetNet
      • SSD Illustration
    • RNN Related
      • GRU vs LSTM
      • BERT
    • Reinforcement Learning
      • AutoML in Practice Review
      • DRL for optimal execution of profolio transaction
    • Multi-task
      • Multi-task Overview
      • What are the tricks in Multi-Task network design?
    • Neural Network Interpretation
      • Neuron Visualization
    • Deep Learning Frameworks
      • How does Caffe work
      • [Gluon] When to use (Hybrid)Sequential and (Hybrid)Block
      • Gluon Hybrid Intro
      • Gluon HybridBlocks Walk-Through
      • A quick tour of Torch internals
      • NCHW / NHWC in Pytorch
      • Static & Dynamic Computation Graph
    • Converting Between DL Frameworks
      • Things To Be Considered When Doing Model Converting
      • Caffe to TensorFlow
    • Computation Graph Optimization
      • Two ways of TensorRT to optimize Neural Network Computation Graph
      • Customized Caffe Memory Optimization
      • NCNN Memory Optimization
      • Symbolic Programs Advantages: More Efficient, Reuse Intermediate Memory, Operation Folding
    • Deep Learning Debug
      • Problems caused by dead ReLU
      • Loss jumps to 87.3365
      • Common Causes of NANs During Training
    • Deployment
      • Efficient Convolution Operation
      • Quantization
    • What I read recently
      • Know Google the Paper Way
      • ECCV 2018
      • Neural Machine Translation
      • Street View OCR Extraction System
      • Teaching Machines to Draw
      • Pixel to Graph
      • Burst Image Deblurring
      • Material for Masses
      • Learning to Separate Object Sounds by Watching Unlabeled Video
    • Papers / Posts to be read
    • Dummy thoughts
  • Machine Learning
    • Classification
    • Regression
    • Clustering
    • Dimension Reduction
    • Metrics
    • Regularization
    • Bayesian Example
    • Machine Learning System Design
    • Recommendation
    • Essentials of Machine Learning
    • Linear Regression
    • Logistic Regression
      • Logistic Function
    • Gaussian Discriminant Analysis
    • Naive Bayes
    • SVM
    • MLE vs MAP
    • Boosting
    • Frequent Questions
    • Conclusion of Machine Learning
  • Python notes
    • Python _ or __ underscores usage
    • Python Multiprocess and Threading Differences
    • Heapq vs. Q.PriorityQueue
    • Python decorator
    • Understanding Python super()
    • @ property
    • Python __all__
    • Is Python List a Linked List or Array
    • What is the "u" in u'Hello world'
    • Python "self"
    • Python object and class
    • Python Class' Instance method, Class method, and Static Methods Demystified
    • Python WTF
    • Python find first value index in a list: [list].index(val)
    • Sort tuples, and lambda usecase
    • Reverse order of range()
    • Python check list is empty
    • Python get ASCII value from character
    • An A-Z of useful Python tricks
    • Python nested function variable scope
    • Python reverse a list
    • Python priority queue -- heapq
  • C++ Notes
    • Templates
    • std::string (C++) and char* (or c-string "string" for C)
    • C++ printf and cout
    • Class Member Function
    • Inline
    • Scope Resolution Operator ::
    • Constructor
    • Destructor
    • Garbage Collection is Critical
    • C++ Question Lists
  • Operating System
    • Basics
    • Mutex & Semaphore
    • Ticket Selling System
    • OS and Memory
    • Sort implementation in STL
    • Compile, link, loading & run
    • How to understand Multithreading and Multiprocessing from the view of Operating System
  • Linux & Productivity
    • Jupyter Notebook on Remote Server
    • Nividia-smi monitoring
  • Leetcode Notes
    • Array
      • 11. Container With Most Water
      • 35. Search Insert Position
    • Linked List
      • Difference between Linked List and Array
      • Linked List Insert
      • Design of Linked List
      • Two Pointers
        • 141. Linked List Cycle
        • 142. Linked List Cycle II
        • 160. Intersection of two Linked List
        • 19. Remove N-th node from the end of linked list
      • 206. Reverse Linked List
      • 203. Remove Linked List Elements
      • 328. Odd Even Linked List
      • 234. Palindrome Linked List
      • 21. Merge Two Sorted Lists
      • 430. Flatten a Multilevel Doubly Linked List
      • 430. Flatten a Multilevel Doubly Linked List
      • 708. Insert into a Cyclic Sorted List
      • 138. Copy List with Random Pointer
      • 61. Rotate List
    • Binary Tree
      • 144. Binary Tree Preorder Traversal
      • 94. Binary Tree Iterative In-order Traverse
    • Binary Search Tree
      • 98. Validate Binary Search Tree
      • 285. Inorder Successor in BST
      • 173. Binary Search Tree Iterator
      • 700. Search in a Binary Search Tree
      • 450. Delete Node in a BST
      • 701. Insert into a Binary Search Tree
      • Kth Largest Element in a Stream
      • Lowest Common Ancestor of a BST
      • Contain Duplicate III
      • Balanced BST
      • Convert Sorted Array to Binary Search Tree
    • Dynamic Programming
      • 198. House Robber
      • House Robber II
      • Unique Path
      • Unique Path II
      • Best time to buy and sell
      • Partition equal subset sum
      • Target Sum
      • Burst Ballons
    • DFS
      • Clone Graph
      • General Introduction
      • Array & String
      • Sliding Window
  • Quotes
    • Concert Violinist Joke
    • 船 Ship
    • What I cannot create, I do not understand
    • Set your course by the stars
    • To-do list
Powered by GitBook
On this page
  1. Deep Learning
  2. Multi-task

What are the tricks in Multi-Task network design?

作者:Alan Huang 链接:https://www.zhihu.com/question/268105631/answer/335246543 来源:知乎

对于 gradient balancing问题,刘诗昆同学回答得挺不错。 我这边再额外补充一些。

multi-task learning 中, tasks之间彼此的相容性对结果也会有一些影响。当两个任务矛盾的时候, 往往结果会比单任务还要差不少。

Multi-task learning 还需要解决的是Gradient domination的问题。 这个问题产生的原因是不同任务的loss的梯度相差过大, 导致梯度小的loss在训练过程中被梯度大的loss所带走。 题主所说的问题1和2都是指这个问题。

如果一开始就给不同的Loss进行加权, 让它们有相近的梯度, 是不是就能训练的好呢? 结果往往不是这样的。不同的loss, 他们的梯度在训练过程中变化情况也是不一样的;而且不同的loss, 在梯度值相同的时候, 它们在task上的表现也是不同的。在训练开始的时候,虽然balance了, 但是随着训练过程的进行, 中间又发生gradient domination了。 所以要想解决这个问题, 还是要合适地对不同loss做合适的均衡。

实践中应该要如何调整呢? 其实很简单:

假设我们有两个task, 用A和B表示。 假设网络设计足够好, 容量足够大, 而且两个任务本身具有相关性,能够训得足够好。

如果A和B单独训练, 他们在收敛的时候的梯度大小分别记为 Grad_a, Grad_b, 那么我们只需要在两个任务一起训练的时候, 分别用各自梯度的倒数(1/Grad_a, 1/Grad_b)对两个任务做平衡, 然后统一乘一个scalar就可以了。(根据单任务的收敛时候的loss梯度去确定multi-task训练中不同任务的权重。)

因为loss的梯度在训练中通常会变小(这里用通常是因为一般mean square error等loss是这样, 其他有的Loss并不是。), 如果我们确定这个网络在multi-task训练的时候能够达到原来的效果, 我们就只需要把平衡点设在两个任务都足够好的时候。 这样网络在训练过程中, 就自然能够达到那个平衡点, 即使一开始的时候会有gradient domination出现。

PreviousMulti-task OverviewNextNeural Network Interpretation

Last updated 6 years ago