The Truth of Sisyphus
  • Introduction
  • Deep Learning
    • Basics
      • Hinge Loss
      • Regularizations
      • Linear Classification
      • Multi-Class and Cross Entropy Loss
      • Batch Norm and other Normalizations
      • Optimization
      • Optimization Functions
      • Convolution im2col
      • Activation Functions
      • Derivatives
        • Derivatives of Softmax
        • A Smooth (differentiable) Max Function
      • Model Ensemble
      • Layers Python Implementation
    • Classification
      • Mobile friendly networks
      • Non-local Neural Networks
      • Squeeze-and-Excitation Networks
      • Further Attention Utilization -- Efficience & Segmentation
      • Group Norm
      • ShuffleNet V2
    • Segmentation
      • Several Instance Segmentation
      • A Peek at Semantic Segmentation
      • Design Choices for Mobile Friendly Deep Learning Models, Semantic Segmentation
      • Efficient Video Object Segmentation via Network Modulation
      • BiSeNet
      • DeepLabV3+
    • Detection
      • CornerNet
      • IoU-Net
      • Why smooth L1 is popular in BBox Regression
      • MTCNN-NCNN
      • DetNet
      • SSD Illustration
    • RNN Related
      • GRU vs LSTM
      • BERT
    • Reinforcement Learning
      • AutoML in Practice Review
      • DRL for optimal execution of profolio transaction
    • Multi-task
      • Multi-task Overview
      • What are the tricks in Multi-Task network design?
    • Neural Network Interpretation
      • Neuron Visualization
    • Deep Learning Frameworks
      • How does Caffe work
      • [Gluon] When to use (Hybrid)Sequential and (Hybrid)Block
      • Gluon Hybrid Intro
      • Gluon HybridBlocks Walk-Through
      • A quick tour of Torch internals
      • NCHW / NHWC in Pytorch
      • Static & Dynamic Computation Graph
    • Converting Between DL Frameworks
      • Things To Be Considered When Doing Model Converting
      • Caffe to TensorFlow
    • Computation Graph Optimization
      • Two ways of TensorRT to optimize Neural Network Computation Graph
      • Customized Caffe Memory Optimization
      • NCNN Memory Optimization
      • Symbolic Programs Advantages: More Efficient, Reuse Intermediate Memory, Operation Folding
    • Deep Learning Debug
      • Problems caused by dead ReLU
      • Loss jumps to 87.3365
      • Common Causes of NANs During Training
    • Deployment
      • Efficient Convolution Operation
      • Quantization
    • What I read recently
      • Know Google the Paper Way
      • ECCV 2018
      • Neural Machine Translation
      • Street View OCR Extraction System
      • Teaching Machines to Draw
      • Pixel to Graph
      • Burst Image Deblurring
      • Material for Masses
      • Learning to Separate Object Sounds by Watching Unlabeled Video
    • Papers / Posts to be read
    • Dummy thoughts
  • Machine Learning
    • Classification
    • Regression
    • Clustering
    • Dimension Reduction
    • Metrics
    • Regularization
    • Bayesian Example
    • Machine Learning System Design
    • Recommendation
    • Essentials of Machine Learning
    • Linear Regression
    • Logistic Regression
      • Logistic Function
    • Gaussian Discriminant Analysis
    • Naive Bayes
    • SVM
    • MLE vs MAP
    • Boosting
    • Frequent Questions
    • Conclusion of Machine Learning
  • Python notes
    • Python _ or __ underscores usage
    • Python Multiprocess and Threading Differences
    • Heapq vs. Q.PriorityQueue
    • Python decorator
    • Understanding Python super()
    • @ property
    • Python __all__
    • Is Python List a Linked List or Array
    • What is the "u" in u'Hello world'
    • Python "self"
    • Python object and class
    • Python Class' Instance method, Class method, and Static Methods Demystified
    • Python WTF
    • Python find first value index in a list: [list].index(val)
    • Sort tuples, and lambda usecase
    • Reverse order of range()
    • Python check list is empty
    • Python get ASCII value from character
    • An A-Z of useful Python tricks
    • Python nested function variable scope
    • Python reverse a list
    • Python priority queue -- heapq
  • C++ Notes
    • Templates
    • std::string (C++) and char* (or c-string "string" for C)
    • C++ printf and cout
    • Class Member Function
    • Inline
    • Scope Resolution Operator ::
    • Constructor
    • Destructor
    • Garbage Collection is Critical
    • C++ Question Lists
  • Operating System
    • Basics
    • Mutex & Semaphore
    • Ticket Selling System
    • OS and Memory
    • Sort implementation in STL
    • Compile, link, loading & run
    • How to understand Multithreading and Multiprocessing from the view of Operating System
  • Linux & Productivity
    • Jupyter Notebook on Remote Server
    • Nividia-smi monitoring
  • Leetcode Notes
    • Array
      • 11. Container With Most Water
      • 35. Search Insert Position
    • Linked List
      • Difference between Linked List and Array
      • Linked List Insert
      • Design of Linked List
      • Two Pointers
        • 141. Linked List Cycle
        • 142. Linked List Cycle II
        • 160. Intersection of two Linked List
        • 19. Remove N-th node from the end of linked list
      • 206. Reverse Linked List
      • 203. Remove Linked List Elements
      • 328. Odd Even Linked List
      • 234. Palindrome Linked List
      • 21. Merge Two Sorted Lists
      • 430. Flatten a Multilevel Doubly Linked List
      • 430. Flatten a Multilevel Doubly Linked List
      • 708. Insert into a Cyclic Sorted List
      • 138. Copy List with Random Pointer
      • 61. Rotate List
    • Binary Tree
      • 144. Binary Tree Preorder Traversal
      • 94. Binary Tree Iterative In-order Traverse
    • Binary Search Tree
      • 98. Validate Binary Search Tree
      • 285. Inorder Successor in BST
      • 173. Binary Search Tree Iterator
      • 700. Search in a Binary Search Tree
      • 450. Delete Node in a BST
      • 701. Insert into a Binary Search Tree
      • Kth Largest Element in a Stream
      • Lowest Common Ancestor of a BST
      • Contain Duplicate III
      • Balanced BST
      • Convert Sorted Array to Binary Search Tree
    • Dynamic Programming
      • 198. House Robber
      • House Robber II
      • Unique Path
      • Unique Path II
      • Best time to buy and sell
      • Partition equal subset sum
      • Target Sum
      • Burst Ballons
    • DFS
      • Clone Graph
      • General Introduction
      • Array & String
      • Sliding Window
  • Quotes
    • Concert Violinist Joke
    • 船 Ship
    • What I cannot create, I do not understand
    • Set your course by the stars
    • To-do list
Powered by GitBook
On this page
  • Traditional Methods
  • Basic Components for Deep Learning Methods
  • Two Choices for Backbone
  • Enlarge Feature Map
  • Affinity Field
  • Attention Module
  • Network Evolutions
  • FCN
  • UNet
  • DRN
  • PSPNet
  • DeepLab Series
  • With Attention
  • With Affinity Field
  • Counter Specific Problems
  • Deconv -- Check-board pattern
  • Dilated Convolution -- Grid Pattern
  1. Deep Learning
  2. Segmentation

A Peek at Semantic Segmentation

PreviousSeveral Instance SegmentationNextDesign Choices for Mobile Friendly Deep Learning Models, Semantic Segmentation

Last updated 6 years ago

Traditional Methods

Graph Cut

Grab Cut

Interactive optimization version of Graph Cut.

CRF

Basic Components for Deep Learning Methods

In brief, we can see segmentation as a pixel wise classification task. Since we need to do it in pixel level, a large and representative final feature map is needed.

Two Choices for Backbone

To have good and large final feature map, we want it contains three elements: large size, plenty of high level abstract information (go deep) and also rich low level details (skips from small stride layers).

There are two choices to ensure the requirements:

  • Reuse and modify pre-trained SOTA classification networks.

As for reusing existed classification models, we need to enlarge the backbone late stage feature maps as most networks having 1/32 or 1/64 FMs in the end, which are far too small for segmentation.

Enlarge Feature Map

There are two ways to make it large:

  • Transposed Convolution / Bilinear Interpolation to do upsampling from small feature map.

  • Dilated Convolution to keep the size of early stage large FM and never do downsampling after target strides, like keeping stride 8 in DRN / PSPNet / DeepLab V3 or stride 16 in DeepLab V3+.

Transposed Convolution (Deconv):

Upsampling via Interpolation:

Transposed Convolution as Bilinear InterpolatIon:

With carefully chosen initial weighs, Deconv can work as Bilinear Interpolation. Below is MxNet Bilinear filler for Deconv.

class Bilinear(Initializer)
"""Initialize weight for upsampling layers."""
def __init__(self):
    super(Bilinear, self).__init__()
def _init_weight(self, _, arr):
    weight = np.zeros(np.prod(arr.shape), dtype='float32')
    shape = arr.shape
    f = np.ceil(shape[3] / 2.)
    c = (2 * f - 1 - f % 2) / (2. * f)
    for i in range(np.prod(shape)):
        x = i % shape[3]
        y = (i / shape[3]) % shape[2]
        weight[i] = (1 - abs(x / f - c)) * (1 - abs(y / f - c))
    arr[:] = weight.reshape(shape)

Dilated Convolution (Atrous Convolution):

Stop doing downsampling after the last target stride (St, e.g. St = 8) layer in the middle of the backbone network, and use dilated rate = 2, 4 ... on the following original stride St * 2 (stride 16) , St * 4 (stride 32)... layers to keep the pre-trained backbone parameters still having its original receptive fields.

For the layers that are changed to stride 16 but had stride 32 in original backbone, to keep its receptive field unchanged, a 5 x 5 receptive field is needed. By using dilated convolution, a convolution with a dilated 2, 3 x 3 filter would make it able to cover an area equivalent to a 5 x 5.

Affinity Field

An affinity matrix is a generic matrix that determines how close, or similar, two points are in a space. In computer vision tasks, it is a weighted graph that regards each pixel as a node, and connects each pair of pixels by an edge. The weight on that edge should reflect the pairwise similarity with respect to different tasks.

Some papers propose to estimate / learn the affinity field and use it to improve the boundary performance.

Attention Module

Channel Attention

Spatial Attention

Network Evolutions

FCN

  • It upsamples only once. i.e. it has only one layer in the decoder.

  • Variants of FCN-[FCN 16s and FCN 8s] add the skip connections from lower layers to make the output robust to scale changes.

UNet

  • Multiple upsampling layers

  • Uses skip connections and concatenates instead of adding up

  • Uses learnable weight filters instead of fixed interpolation technique

DRN

Stop doing downsampling after the last target stride (St, e.g. St = 8) layer in the middle of the backbone network, and use dilated rate = 2, 4 ... at original stride St * 2 (stride 16) , St * 4 (stride 32)... layers to keep the pre-trained backbone parameters still having its original receptive fields.

PSPNet

  • DRN as backbone

  • Auxiliary loss between the end of backbone and SPP

  • Synchronized BN for training with small batch size

  • Dropout before classifier

  • Hard to reimplement, rumors said the ResNet-101 backbone was pre-trained on SenseTime internal classification dataset.

DeepLab Series

DeepLab 2014

  • Dilated convolution in VGG16

  • CRF as post-processing

DeepLab V2 2016

  • DRN backbone

  • ASPP for multi-scale context fusion

  • Dense CRF as post-processing

  • Stride 8 backbone output

DeepLab V3 2017

  • Revisited ASPP

    • Abandon the branch with dilation rate 24, since large dilation rate kernel will degrade to a 1 x 1 filter at last.

    • Add global average pooling branch with bilinear interpolation upsampling, just like the SPP 1 x 1 branch in PSPNet.

  • Train on stride 16 final output first, freeze all BN and then continue training on stride 8 final output. An alternative plan to synchronized BN.

  • After trained on VOC augmented, do fine-tuning on VOC 2012 official trainval set only.

  • Double the hard classes images in one epoch

  • No CRF

  • Use backbone trained on Google internal classification dateset wilL boost the segmentation performance.

DeepLab V3+ 2018

  • An efficient combination of SPP series and Encoder-decoder

  • ASPP is same

  • Dilated Xception backbone, with stride 16 final backbone output. Xception itself is an high performance and also efficient backbone. Computation complexity of stride 16 also is 4 times smaller than the stride 8 version. Backbone (encoder) part is quite fast compared to predecessors.

  • Use bilinear interpolation as internal upsampling method

  • Lowe level FM channels are reduct to a small amount (48) and then concatenated after high level upsampled FM.

  • Training tricks are same to DeepLab V3

  • SOTA performance

With Attention

Dual Attention Network

  • Feature map dot product Attention for both channel wise and spatial wise

BiSeNet

A fast segmentation structure built on Xception 39, very shallow spatial branch sub-net and channel wise attention.

  • Main idea contains two parts: a thin and deep part for extracting context info, as well as a wide and shallow part for extracting spatial info.

  • ARM / FFM here are both channel-wise attention module

  • FM + FM * Attention (skip version in FFM) outperforms the dot production only version (in ARM). But because element wise addition costs extra computation, FFM is only used at the most critical part -- feature fusion.

With Affinity Field

SPN

  • Use propagation idea to accumulate affinity field

  • The learning of a large affinity matrix for diffusion is converted to learning a local linear spatial propagation, yielding a simple while effective approach for output enhancement.

Adaptive Affinity Fields

  • Treat affinity field optimization as a separate loss for segmentation

CSPN for Depth Estimation

  • Replace three way propagation with convolution

  • Use CSPN aggregating sparse Lidar depth info to refine the final results (Depth Estimation).

Counter Specific Problems

Deconv -- Check-board pattern

In order to enlarge the feature map, we often have to use Deconv, which allow the model to use every point in the input small feature map to "paint" a square in the output larger one. However, If Deconv kernel size cannot be divided by its stride, it can easily have "uneven overlap" and well-known check-board pattern in the output.

In theory, our models could learn to have optimal weights distribution according to the unevenly position and make the output balanced. But in practical, neural networks struggle to achieve that.

In fact, the real situation is far from the optimal and even worse than we thought. Not only do models with uneven overlap not learn to avoid this, but models with even overlap often learn kernels that cause similar artifacts.

An alternative way to eliminate check-board pattern is to do interpolation on the input small feature map and then use the regular Convolution.

Dilated Convolution -- Grid Pattern

  • If stacking dilation rate = 2 blocks, there will be only some sparse grid pixels from input contributing to the output.

  • The paper claims that using HDC (hybrid dense convolution) could eliminate grid pattern.

and train on ImageNet from scratch. (Check the link for further analysis on this topic)

Some of recent segmentation papers use attention module to improve the performance. Here are two attention module design choices: channel and spatial attention. For more efficiency discussion, please check .

method uses fully connected layer to estimate / calculate the channel-wise attention weights.

method exploits Self-Attention (like covariance matrix) to enhance and suppress targeted spatial information.

The original implementation uses bilinear interpolation for upsampling the convoloved image. That is there is no learnable filter here.

The visualization of two attention maps are quite interesting. Check for further pics.

Design a segmentation specific backbone
here
Squeeze and Excitation
Non-local
github repo
here
http://vision.stanford.edu/teaching/cs231b_spring1415/slides/lecture2_segmentation.pdf
LogoA 2017 Guide to Semantic Segmentation with Deep Learningqure_ai
Deeplab Image Semantic Segmentation Network - Thalles' blog
Deconvolution and Checkerboard ArtifactsDistill
Stride 16 version DeepLab V3
(a) DeepLab V3 / PSPNet (b) UNet (c) DeepLab V3+
Detailed DeepLab V3+
Check-board Pattern
Kernel Size 3, Stride 2
Kernel Size 4, Stride 2
Check-board Pattern
Optimal Kernel
Evenly overlap but still have check-board pattern
Grid Pattern