A Peek at Semantic Segmentation

Traditional Methods

Graph Cut

Grab Cut

Interactive optimization version of Graph Cut.


Basic Components for Deep Learning Methods

In brief, we can see segmentation as a pixel wise classification task. Since we need to do it in pixel level, a large and representative final feature map is needed.

Two Choices for Backbone

To have good and large final feature map, we want it contains three elements: large size, plenty of high level abstract information (go deep) and also rich low level details (skips from small stride layers).

There are two choices to ensure the requirements:

As for reusing existed classification models, we need to enlarge the backbone late stage feature maps as most networks having 1/32 or 1/64 FMs in the end, which are far too small for segmentation.

Enlarge Feature Map

There are two ways to make it large:

  • Transposed Convolution / Bilinear Interpolation to do upsampling from small feature map.

  • Dilated Convolution to keep the size of early stage large FM and never do downsampling after target strides, like keeping stride 8 in DRN / PSPNet / DeepLab V3 or stride 16 in DeepLab V3+.

Transposed Convolution (Deconv):

Upsampling via Interpolation:

Transposed Convolution as Bilinear InterpolatIon:

With carefully chosen initial weighs, Deconv can work as Bilinear Interpolation. Below is MxNet Bilinear filler for Deconv.

class Bilinear(Initializer)
"""Initialize weight for upsampling layers."""
def __init__(self):
super(Bilinear, self).__init__()
def _init_weight(self, _, arr):
weight = np.zeros(np.prod(arr.shape), dtype='float32')
shape = arr.shape
f = np.ceil(shape[3] / 2.)
c = (2 * f - 1 - f % 2) / (2. * f)
for i in range(np.prod(shape)):
x = i % shape[3]
y = (i / shape[3]) % shape[2]
weight[i] = (1 - abs(x / f - c)) * (1 - abs(y / f - c))
arr[:] = weight.reshape(shape)

Dilated Convolution (Atrous Convolution):

Stop doing downsampling after the last target stride (St, e.g. St = 8) layer in the middle of the backbone network, and use dilated rate = 2, 4 ... on the following original stride St * 2 (stride 16) , St * 4 (stride 32)... layers to keep the pre-trained backbone parameters still having its original receptive fields.

For the layers that are changed to stride 16 but had stride 32 in original backbone, to keep its receptive field unchanged, a 5 x 5 receptive field is needed. By using dilated convolution, a convolution with a dilated 2, 3 x 3 filter would make it able to cover an area equivalent to a 5 x 5.

Affinity Field

An affinity matrix is a generic matrix that determines how close, or similar, two points are in a space. In computer vision tasks, it is a weighted graph that regards each pixel as a node, and connects each pair of pixels by an edge. The weight on that edge should reflect the pairwise similarity with respect to different tasks.

Some papers propose to estimate / learn the affinity field and use it to improve the boundary performance.

Attention Module

Some of recent segmentation papers use attention module to improve the performance. Here are two attention module design choices: channel and spatial attention. For more efficiency discussion, please check here.

Channel Attention

Squeeze and Excitation method uses fully connected layer to estimate / calculate the channel-wise attention weights.

Spatial Attention

Non-local method exploits Self-Attention (like covariance matrix) to enhance and suppress targeted spatial information.

Network Evolutions


  • It upsamples only once. i.e. it has only one layer in the decoder.

  • The original implementation github repo uses bilinear interpolation for upsampling the convoloved image. That is there is no learnable filter here.

  • Variants of FCN-[FCN 16s and FCN 8s] add the skip connections from lower layers to make the output robust to scale changes.


  • Multiple upsampling layers

  • Uses skip connections and concatenates instead of adding up

  • Uses learnable weight filters instead of fixed interpolation technique


Stop doing downsampling after the last target stride (St, e.g. St = 8) layer in the middle of the backbone network, and use dilated rate = 2, 4 ... at original stride St * 2 (stride 16) , St * 4 (stride 32)... layers to keep the pre-trained backbone parameters still having its original receptive fields.


  • DRN as backbone

  • Auxiliary loss between the end of backbone and SPP

  • Synchronized BN for training with small batch size

  • Dropout before classifier

  • Hard to reimplement, rumors said the ResNet-101 backbone was pre-trained on SenseTime internal classification dataset.

DeepLab Series

DeepLab 2014

  • Dilated convolution in VGG16

  • CRF as post-processing

DeepLab V2 2016

  • DRN backbone

  • ASPP for multi-scale context fusion

  • Dense CRF as post-processing

  • Stride 8 backbone output

DeepLab V3 2017

Stride 16 version DeepLab V3
  • Revisited ASPP

    • Abandon the branch with dilation rate 24, since large dilation rate kernel will degrade to a 1 x 1 filter at last.

    • Add global average pooling branch with bilinear interpolation upsampling, just like the SPP 1 x 1 branch in PSPNet.

  • Train on stride 16 final output first, freeze all BN and then continue training on stride 8 final output. An alternative plan to synchronized BN.

  • After trained on VOC augmented, do fine-tuning on VOC 2012 official trainval set only.

  • Double the hard classes images in one epoch

  • No CRF

  • Use backbone trained on Google internal classification dateset wilL boost the segmentation performance.

DeepLab V3+ 2018

(a) DeepLab V3 / PSPNet (b) UNet (c) DeepLab V3+
Detailed DeepLab V3+
  • An efficient combination of SPP series and Encoder-decoder

  • ASPP is same

  • Dilated Xception backbone, with stride 16 final backbone output. Xception itself is an high performance and also efficient backbone. Computation complexity of stride 16 also is 4 times smaller than the stride 8 version. Backbone (encoder) part is quite fast compared to predecessors.

  • Use bilinear interpolation as internal upsampling method

  • Lowe level FM channels are reduct to a small amount (48) and then concatenated after high level upsampled FM.

  • Training tricks are same to DeepLab V3

  • SOTA performance

With Attention

Dual Attention Network

  • Feature map dot product Attention for both channel wise and spatial wise

  • The visualization of two attention maps are quite interesting. Check here for further pics.


A fast segmentation structure built on Xception 39, very shallow spatial branch sub-net and channel wise attention.

  • Main idea contains two parts: a thin and deep part for extracting context info, as well as a wide and shallow part for extracting spatial info.

  • ARM / FFM here are both channel-wise attention module

  • FM + FM * Attention (skip version in FFM) outperforms the dot production only version (in ARM). But because element wise addition costs extra computation, FFM is only used at the most critical part -- feature fusion.

With Affinity Field


  • Use propagation idea to accumulate affinity field

  • The learning of a large affinity matrix for diffusion is converted to learning a local linear spatial propagation, yielding a simple while effective approach for output enhancement.

Adaptive Affinity Fields

  • Treat affinity field optimization as a separate loss for segmentation

CSPN for Depth Estimation

  • Replace three way propagation with convolution

  • Use CSPN aggregating sparse Lidar depth info to refine the final results (Depth Estimation).

Counter Specific Problems

Deconv -- Check-board pattern

Check-board Pattern

In order to enlarge the feature map, we often have to use Deconv, which allow the model to use every point in the input small feature map to "paint" a square in the output larger one. However, If Deconv kernel size cannot be divided by its stride, it can easily have "uneven overlap" and well-known check-board pattern in the output.

Kernel Size 3, Stride 2
Kernel Size 4, Stride 2
Check-board Pattern

In theory, our models could learn to have optimal weights distribution according to the unevenly position and make the output balanced. But in practical, neural networks struggle to achieve that.

Optimal Kernel

In fact, the real situation is far from the optimal and even worse than we thought. Not only do models with uneven overlap not learn to avoid this, but models with even overlap often learn kernels that cause similar artifacts.

Evenly overlap but still have check-board pattern

An alternative way to eliminate check-board pattern is to do interpolation on the input small feature map and then use the regular Convolution.

Dilated Convolution -- Grid Pattern

Grid Pattern
  • If stacking dilation rate = 2 blocks, there will be only some sparse grid pixels from input contributing to the output.

  • The paper claims that using HDC (hybrid dense convolution) could eliminate grid pattern.