Further Attention Utilization -- Efficience & Segmentation

Non-local method exploits Self-Attention (like covariance matrix) to enhance and suppress targeted spatial information.

Squeeze and Excitation method uses fully connected layer to estimate / calculate the channel-wise attention weights.

CBAM: Convolutional Block Attention Module

A simple yet effective attention module for convolutional neural networks. Given an intermediate feature map, CBAM sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement.

Channel Attention and Spatial Attention are processed sequentially, and both are calculated via convolution layers ( 1 x 1 for Channel Attention, 7 x 7 for Spatial Attention) not by covariance similarity or self-attention.

In the CBAM paper, author has done detailed ablation experiments to compare average pooling vs max pooling vs average pooling + max pooling in channel attention branch ( average + max > average > max ); and channel pooling vs conv 1 x 1 channel compression in first stage of spatial attention branch (channel pooling > 1 x 1 conv); as well as conv 3 x 3 vs conv 7 x 7 in second stage of spatial attention branch ( conv 7 x 7 > conv 3 x 3). Also, the best way to integrate spatial and channel attention is also explored, either sequence of channel + spatial, spatial + channel or channel & spatial in parallel ( channel + spatial > spatial + channel > channel & spatial in parallel)

Channel Attention Computation Complexity

The FLOPs of Squeeze-Excitation channel attention structure are quite small, and the only limitation is that the size of conv 1 x 1 is Cin x 1 x 1 x Cout, which could not be neglected especially in the late stage of deep neural networks. And for channel / spatial separated structures, like MobileNet, this could double the size of the whole model. So in CBAM, it shares the FC layer for both max pooling branch and average pooling branch so that the attention module parameters won't be doubled and it does work.

Spatial Attention Computation Complexity

Self-Attention (feature map spatial covariance matrix), like non-local neural network, has to calculate [NHW x C] dot product [C x NHW] ( FLOPS is NHW x Cin x NHW, compared to regular convolution NHW x K x K x Cin x Cout, the non-local attention module could cost lots of computations). Even non-local network has already done a channel bottleneck at the beginning for halfening the input channels ( e.g. 1024 -> 1 x 1 conv -> 512 ), the FLOP is still quite large. Different from non-local spatial attention to use self-attention (covariance) on all channels (e.g. 1024 channels), CBAM here only utilize 2 channels (average pooling and max spatial pooling) and use 7 x 7 convolution to calculate / estimate spatial attention, with NHW x 7 x 7 x 2 x 1 FLOPS. When the feature map NHW x Cin > 7 x 7 x 2 x 1, the CBAM will always needs less computations than non-local spatial attention.

Considering the CBAM's efficiency in both channel attention and spatial attention, it could be migrated to light-weight networks like MobileNet and CBAM shows good results on .

Sparse Attention

Generating Long Sequences with Sparse TransformersarXiv.org

Stride Sparse Attention: A locally dense, but long-range sparse sampling strategy for Attention. To achieve this, these two sampling methods should be repeated alternatively.

Because we can only access the previous sequence, so the sampling would only be the lower triangular matrix:

So that Sparse Attention can cover longer sampled sequence length, under the same computation limitation. OpenAI used this method to extend Transformer's sequence sampling length 30 times than the original way.

Dual Attention Network for Scene Segmentation

DANet is a segmentation network also based on two way attentions: position (spatial) attention and channel attention, but in parallel way.

DANet doesn't use FC or Convolution to calculate the Attention Map, instead for either position or channel attention, it always uses the self-attention/covariance. The # of weights is definitely smaller but less FLOPs cannot be promised. Compared to its baseline FCN model, DANet does help for boost the performance.

The whole idea of DANet is straightforward, more interestingly, the results of spatial attention experiment is quite illustrative.

PreviousSqueeze-and-Excitation Networks NextGroup Norm

Last updated 6 years ago