Design Choices for Mobile Friendly Deep Learning Models, Semantic Segmentation

Define Task and Preprocess Data

  • Define a general task and find as many labeled GT as possible.

  • Find a specific task which might be a part of the original task. This might only have 1/ 10 GTs. But collect tons of specific scenario images, w.o. GTs.

  • Train SOTA performance model, might be very computation redundant, on the general GTs.

  • Use the SOTA model to do inference on the collected scenario-specific images.

  • Post process and manually select SOTA inference results as GT' to train small models.

  • Do hard negative mining on both image level and pixel level.

Fast and Accurate Backbone

Here I will use semantic segmentation as an example, but we can treat the backbone part as a feature encoder, most of techniques here can be also utilized in other tasks, for example DetNet using dilated convolution as a better detection backbone

  • Large kernels / Fast downsampling at the beginning.

  • keep at lest 1 / 16 resolution for most of the rest part.

  • At the end of the backbone, go deeper with lower resolution to capture better global information. A good take-home is 1 / 64, and 5 - 10 layers deeper.

  • Use Hybrid Dense Convolution (dilation rate 1, 2, 5 sequentially) to eliminate grid pattern.

  • When having same FLOP, keep MAC as low as possible. Take-home: same input / output channels.

  • Doing convolution on half of channels and then shuffling is good enough.

  • Train on larger classification dataset than ImageNet will boost the performance with no computation complexity increased.

Segmentation Structure

  • SPP is faster than ASPP, especially when being directly used after DRN.

  • Since SPP only needs pooled feature maps, use deepest part of the backbone, which has been introduced above, as SPP input and deepest stride 16 part as base feature map to be refined.

  • Using low-level features always boosts the performance, but the lower we used, the larger the computation complexity it is. Take-home: 1 / 4 is enough for most realtime cases.

  • Both DeepLab V3+ and UNet use low-level features, but the former one works far better on general cases. Reasons are complicated:

    • better final backbone output for V3+;

    • multi-scale information fusion from ASPP;

    • bilinear interpolation performs similar to Deconvolution on most cases but runs more efficiently;

    • better training strategy, large batch size to accumulate BN means and variances and fix BN to train convolutions.

  • Bilinear interpolation works good on small objects but not as good as Deconv on large ones and binary classes.

  • Also use depth-wise + point-wise convolutions in decoder, or shuffle unit in decoder.

  • Squeeze and excitation channel attention could be used after SPP. Skip + sigmoid works better than sigmoid only.

  • S and E channel attention doesn't cost too much flops but it does need some parameters. Further efficient attention method could be used, like CBAM.

  • Use cross-entropy + L2 regression + dice loss for binary tasks.

  • Post Processing:


    • Keep it steady is important to human visual system.

    • Try as many traditional image processing methods as possible. Ez tricks might make a total difference.

  • Transferred to other tasks:

    • Before the final segmentation classifier, the whole structure descripted above is a feature encoder. Treat it as a FPN and do detection should also work.