A Peek at Semantic Segmentation
Last updated
Last updated
Interactive optimization version of Graph Cut.
In brief, we can see segmentation as a pixel wise classification task. Since we need to do it in pixel level, a large and representative final feature map is needed.
To have good and large final feature map, we want it contains three elements: large size, plenty of high level abstract information (go deep) and also rich low level details (skips from small stride layers).
There are two choices to ensure the requirements:
Reuse and modify pre-trained SOTA classification networks.
Design a segmentation specific backbone and train on ImageNet from scratch. (Check the link for further analysis on this topic)
As for reusing existed classification models, we need to enlarge the backbone late stage feature maps as most networks having 1/32 or 1/64 FMs in the end, which are far too small for segmentation.
There are two ways to make it large:
Transposed Convolution / Bilinear Interpolation to do upsampling from small feature map.
Dilated Convolution to keep the size of early stage large FM and never do downsampling after target strides, like keeping stride 8 in DRN / PSPNet / DeepLab V3 or stride 16 in DeepLab V3+.
With carefully chosen initial weighs, Deconv can work as Bilinear Interpolation. Below is MxNet Bilinear filler for Deconv.
Stop doing downsampling after the last target stride (St, e.g. St = 8) layer in the middle of the backbone network, and use dilated rate = 2, 4 ... on the following original stride St * 2 (stride 16) , St * 4 (stride 32)... layers to keep the pre-trained backbone parameters still having its original receptive fields.
For the layers that are changed to stride 16 but had stride 32 in original backbone, to keep its receptive field unchanged, a 5 x 5 receptive field is needed. By using dilated convolution, a convolution with a dilated 2, 3 x 3 filter would make it able to cover an area equivalent to a 5 x 5.
An affinity matrix is a generic matrix that determines how close, or similar, two points are in a space. In computer vision tasks, it is a weighted graph that regards each pixel as a node, and connects each pair of pixels by an edge. The weight on that edge should reflect the pairwise similarity with respect to different tasks.
Some papers propose to estimate / learn the affinity field and use it to improve the boundary performance.
Some of recent segmentation papers use attention module to improve the performance. Here are two attention module design choices: channel and spatial attention. For more efficiency discussion, please check here.
Squeeze and Excitation method uses fully connected layer to estimate / calculate the channel-wise attention weights.
Non-local method exploits Self-Attention (like covariance matrix) to enhance and suppress targeted spatial information.
It upsamples only once. i.e. it has only one layer in the decoder.
The original implementation github repo uses bilinear interpolation for upsampling the convoloved image. That is there is no learnable filter here.
Variants of FCN-[FCN 16s and FCN 8s] add the skip connections from lower layers to make the output robust to scale changes.
Multiple upsampling layers
Uses skip connections and concatenates instead of adding up
Uses learnable weight filters instead of fixed interpolation technique
Stop doing downsampling after the last target stride (St, e.g. St = 8) layer in the middle of the backbone network, and use dilated rate = 2, 4 ... at original stride St * 2 (stride 16) , St * 4 (stride 32)... layers to keep the pre-trained backbone parameters still having its original receptive fields.
DRN as backbone
Auxiliary loss between the end of backbone and SPP
Synchronized BN for training with small batch size
Dropout before classifier
Hard to reimplement, rumors said the ResNet-101 backbone was pre-trained on SenseTime internal classification dataset.
Dilated convolution in VGG16
CRF as post-processing
DRN backbone
ASPP for multi-scale context fusion
Dense CRF as post-processing
Stride 8 backbone output
Revisited ASPP
Abandon the branch with dilation rate 24, since large dilation rate kernel will degrade to a 1 x 1 filter at last.
Add global average pooling branch with bilinear interpolation upsampling, just like the SPP 1 x 1 branch in PSPNet.
Train on stride 16 final output first, freeze all BN and then continue training on stride 8 final output. An alternative plan to synchronized BN.
After trained on VOC augmented, do fine-tuning on VOC 2012 official trainval set only.
Double the hard classes images in one epoch
No CRF
Use backbone trained on Google internal classification dateset wilL boost the segmentation performance.
An efficient combination of SPP series and Encoder-decoder
ASPP is same
Dilated Xception backbone, with stride 16 final backbone output. Xception itself is an high performance and also efficient backbone. Computation complexity of stride 16 also is 4 times smaller than the stride 8 version. Backbone (encoder) part is quite fast compared to predecessors.
Use bilinear interpolation as internal upsampling method
Lowe level FM channels are reduct to a small amount (48) and then concatenated after high level upsampled FM.
Training tricks are same to DeepLab V3
SOTA performance
Feature map dot product Attention for both channel wise and spatial wise
The visualization of two attention maps are quite interesting. Check here for further pics.
A fast segmentation structure built on Xception 39, very shallow spatial branch sub-net and channel wise attention.
Main idea contains two parts: a thin and deep part for extracting context info, as well as a wide and shallow part for extracting spatial info.
ARM / FFM here are both channel-wise attention module
FM + FM * Attention (skip version in FFM) outperforms the dot production only version (in ARM). But because element wise addition costs extra computation, FFM is only used at the most critical part -- feature fusion.
Use propagation idea to accumulate affinity field
The learning of a large affinity matrix for diffusion is converted to learning a local linear spatial propagation, yielding a simple while effective approach for output enhancement.
Treat affinity field optimization as a separate loss for segmentation
CSPN for Depth Estimation
Replace three way propagation with convolution
Use CSPN aggregating sparse Lidar depth info to refine the final results (Depth Estimation).
In order to enlarge the feature map, we often have to use Deconv, which allow the model to use every point in the input small feature map to "paint" a square in the output larger one. However, If Deconv kernel size cannot be divided by its stride, it can easily have "uneven overlap" and well-known check-board pattern in the output.
In theory, our models could learn to have optimal weights distribution according to the unevenly position and make the output balanced. But in practical, neural networks struggle to achieve that.
In fact, the real situation is far from the optimal and even worse than we thought. Not only do models with uneven overlap not learn to avoid this, but models with even overlap often learn kernels that cause similar artifacts.
An alternative way to eliminate check-board pattern is to do interpolation on the input small feature map and then use the regular Convolution.
If stacking dilation rate = 2 blocks, there will be only some sparse grid pixels from input contributing to the output.
The paper claims that using HDC (hybrid dense convolution) could eliminate grid pattern.