The Truth of Sisyphus
  • Introduction
  • Deep Learning
    • Basics
      • Hinge Loss
      • Regularizations
      • Linear Classification
      • Multi-Class and Cross Entropy Loss
      • Batch Norm and other Normalizations
      • Optimization
      • Optimization Functions
      • Convolution im2col
      • Activation Functions
      • Derivatives
        • Derivatives of Softmax
        • A Smooth (differentiable) Max Function
      • Model Ensemble
      • Layers Python Implementation
    • Classification
      • Mobile friendly networks
      • Non-local Neural Networks
      • Squeeze-and-Excitation Networks
      • Further Attention Utilization -- Efficience & Segmentation
      • Group Norm
      • ShuffleNet V2
    • Segmentation
      • Several Instance Segmentation
      • A Peek at Semantic Segmentation
      • Design Choices for Mobile Friendly Deep Learning Models, Semantic Segmentation
      • Efficient Video Object Segmentation via Network Modulation
      • BiSeNet
      • DeepLabV3+
    • Detection
      • CornerNet
      • IoU-Net
      • Why smooth L1 is popular in BBox Regression
      • MTCNN-NCNN
      • DetNet
      • SSD Illustration
    • RNN Related
      • GRU vs LSTM
      • BERT
    • Reinforcement Learning
      • AutoML in Practice Review
      • DRL for optimal execution of profolio transaction
    • Multi-task
      • Multi-task Overview
      • What are the tricks in Multi-Task network design?
    • Neural Network Interpretation
      • Neuron Visualization
    • Deep Learning Frameworks
      • How does Caffe work
      • [Gluon] When to use (Hybrid)Sequential and (Hybrid)Block
      • Gluon Hybrid Intro
      • Gluon HybridBlocks Walk-Through
      • A quick tour of Torch internals
      • NCHW / NHWC in Pytorch
      • Static & Dynamic Computation Graph
    • Converting Between DL Frameworks
      • Things To Be Considered When Doing Model Converting
      • Caffe to TensorFlow
    • Computation Graph Optimization
      • Two ways of TensorRT to optimize Neural Network Computation Graph
      • Customized Caffe Memory Optimization
      • NCNN Memory Optimization
      • Symbolic Programs Advantages: More Efficient, Reuse Intermediate Memory, Operation Folding
    • Deep Learning Debug
      • Problems caused by dead ReLU
      • Loss jumps to 87.3365
      • Common Causes of NANs During Training
    • Deployment
      • Efficient Convolution Operation
      • Quantization
    • What I read recently
      • Know Google the Paper Way
      • ECCV 2018
      • Neural Machine Translation
      • Street View OCR Extraction System
      • Teaching Machines to Draw
      • Pixel to Graph
      • Burst Image Deblurring
      • Material for Masses
      • Learning to Separate Object Sounds by Watching Unlabeled Video
    • Papers / Posts to be read
    • Dummy thoughts
  • Machine Learning
    • Classification
    • Regression
    • Clustering
    • Dimension Reduction
    • Metrics
    • Regularization
    • Bayesian Example
    • Machine Learning System Design
    • Recommendation
    • Essentials of Machine Learning
    • Linear Regression
    • Logistic Regression
      • Logistic Function
    • Gaussian Discriminant Analysis
    • Naive Bayes
    • SVM
    • MLE vs MAP
    • Boosting
    • Frequent Questions
    • Conclusion of Machine Learning
  • Python notes
    • Python _ or __ underscores usage
    • Python Multiprocess and Threading Differences
    • Heapq vs. Q.PriorityQueue
    • Python decorator
    • Understanding Python super()
    • @ property
    • Python __all__
    • Is Python List a Linked List or Array
    • What is the "u" in u'Hello world'
    • Python "self"
    • Python object and class
    • Python Class' Instance method, Class method, and Static Methods Demystified
    • Python WTF
    • Python find first value index in a list: [list].index(val)
    • Sort tuples, and lambda usecase
    • Reverse order of range()
    • Python check list is empty
    • Python get ASCII value from character
    • An A-Z of useful Python tricks
    • Python nested function variable scope
    • Python reverse a list
    • Python priority queue -- heapq
  • C++ Notes
    • Templates
    • std::string (C++) and char* (or c-string "string" for C)
    • C++ printf and cout
    • Class Member Function
    • Inline
    • Scope Resolution Operator ::
    • Constructor
    • Destructor
    • Garbage Collection is Critical
    • C++ Question Lists
  • Operating System
    • Basics
    • Mutex & Semaphore
    • Ticket Selling System
    • OS and Memory
    • Sort implementation in STL
    • Compile, link, loading & run
    • How to understand Multithreading and Multiprocessing from the view of Operating System
  • Linux & Productivity
    • Jupyter Notebook on Remote Server
    • Nividia-smi monitoring
  • Leetcode Notes
    • Array
      • 11. Container With Most Water
      • 35. Search Insert Position
    • Linked List
      • Difference between Linked List and Array
      • Linked List Insert
      • Design of Linked List
      • Two Pointers
        • 141. Linked List Cycle
        • 142. Linked List Cycle II
        • 160. Intersection of two Linked List
        • 19. Remove N-th node from the end of linked list
      • 206. Reverse Linked List
      • 203. Remove Linked List Elements
      • 328. Odd Even Linked List
      • 234. Palindrome Linked List
      • 21. Merge Two Sorted Lists
      • 430. Flatten a Multilevel Doubly Linked List
      • 430. Flatten a Multilevel Doubly Linked List
      • 708. Insert into a Cyclic Sorted List
      • 138. Copy List with Random Pointer
      • 61. Rotate List
    • Binary Tree
      • 144. Binary Tree Preorder Traversal
      • 94. Binary Tree Iterative In-order Traverse
    • Binary Search Tree
      • 98. Validate Binary Search Tree
      • 285. Inorder Successor in BST
      • 173. Binary Search Tree Iterator
      • 700. Search in a Binary Search Tree
      • 450. Delete Node in a BST
      • 701. Insert into a Binary Search Tree
      • Kth Largest Element in a Stream
      • Lowest Common Ancestor of a BST
      • Contain Duplicate III
      • Balanced BST
      • Convert Sorted Array to Binary Search Tree
    • Dynamic Programming
      • 198. House Robber
      • House Robber II
      • Unique Path
      • Unique Path II
      • Best time to buy and sell
      • Partition equal subset sum
      • Target Sum
      • Burst Ballons
    • DFS
      • Clone Graph
      • General Introduction
      • Array & String
      • Sliding Window
  • Quotes
    • Concert Violinist Joke
    • 船 Ship
    • What I cannot create, I do not understand
    • Set your course by the stars
    • To-do list
Powered by GitBook
On this page
  • CBAM: Convolutional Block Attention Module
  • Sparse Attention
  • Dual Attention Network for Scene Segmentation
  1. Deep Learning
  2. Classification

Further Attention Utilization -- Efficience & Segmentation

PreviousSqueeze-and-Excitation NetworksNextGroup Norm

Last updated 5 years ago

method exploits Self-Attention (like covariance matrix) to enhance and suppress targeted spatial information.

method uses fully connected layer to estimate / calculate the channel-wise attention weights.

CBAM: Convolutional Block Attention Module

A simple yet effective attention module for convolutional neural networks. Given an intermediate feature map, CBAM sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement.

Channel Attention and Spatial Attention are processed sequentially, and both are calculated via convolution layers ( 1 x 1 for Channel Attention, 7 x 7 for Spatial Attention) not by covariance similarity or self-attention.

In the CBAM paper, author has done detailed ablation experiments to compare average pooling vs max pooling vs average pooling + max pooling in channel attention branch ( average + max > average > max ); and channel pooling vs conv 1 x 1 channel compression in first stage of spatial attention branch (channel pooling > 1 x 1 conv); as well as conv 3 x 3 vs conv 7 x 7 in second stage of spatial attention branch ( conv 7 x 7 > conv 3 x 3). Also, the best way to integrate spatial and channel attention is also explored, either sequence of channel + spatial, spatial + channel or channel & spatial in parallel ( channel + spatial > spatial + channel > channel & spatial in parallel)

Channel Attention Computation Complexity

The FLOPs of Squeeze-Excitation channel attention structure are quite small, and the only limitation is that the size of conv 1 x 1 is Cin x 1 x 1 x Cout, which could not be neglected especially in the late stage of deep neural networks. And for channel / spatial separated structures, like MobileNet, this could double the size of the whole model. So in CBAM, it shares the FC layer for both max pooling branch and average pooling branch so that the attention module parameters won't be doubled and it does work.

Spatial Attention Computation Complexity

Self-Attention (feature map spatial covariance matrix), like non-local neural network, has to calculate [NHW x C] dot product [C x NHW] ( FLOPS is NHW x Cin x NHW, compared to regular convolution NHW x K x K x Cin x Cout, the non-local attention module could cost lots of computations). Even non-local network has already done a channel bottleneck at the beginning for halfening the input channels ( e.g. 1024 -> 1 x 1 conv -> 512 ), the FLOP is still quite large. Different from non-local spatial attention to use self-attention (covariance) on all channels (e.g. 1024 channels), CBAM here only utilize 2 channels (average pooling and max spatial pooling) and use 7 x 7 convolution to calculate / estimate spatial attention, with NHW x 7 x 7 x 2 x 1 FLOPS. When the feature map NHW x Cin > 7 x 7 x 2 x 1, the CBAM will always needs less computations than non-local spatial attention.

Considering the CBAM's efficiency in both channel attention and spatial attention, it could be migrated to light-weight networks like MobileNet and CBAM shows good results on .

Sparse Attention

Stride Sparse Attention: A locally dense, but long-range sparse sampling strategy for Attention. To achieve this, these two sampling methods should be repeated alternatively.

Because we can only access the previous sequence, so the sampling would only be the lower triangular matrix:

So that Sparse Attention can cover longer sampled sequence length, under the same computation limitation. OpenAI used this method to extend Transformer's sequence sampling length 30 times than the original way.

Dual Attention Network for Scene Segmentation

DANet is a segmentation network also based on two way attentions: position (spatial) attention and channel attention, but in parallel way.

DANet doesn't use FC or Convolution to calculate the Attention Map, instead for either position or channel attention, it always uses the self-attention/covariance. The # of weights is definitely smaller but less FLOPs cannot be promised. Compared to its baseline FCN model, DANet does help for boost the performance.

The whole idea of DANet is straightforward, more interestingly, the results of spatial attention experiment is quite illustrative.

Non-local
Squeeze and Excitation
Dilated Sampling
Local Sampling
Locally dense, long-range sparse Sampling
Left: original Attention Sampling; Right: Sparse Attention seeing previous sequence
Spatial Attention Map for 2 points
Channel Attention Map for two selected channels
LogoGenerating Long Sequences with Sparse TransformersarXiv.org