Efficient Convolution Operation
Regular Convolution
Sliding window over the image and do multi-adds on patch level one by one.
Image to Column Convolution
To better utilize CPU cache and the locality of reference, we convert convolution operations into matrix multiplication.
The Locality of Reference (Principal of Locality)
Processor tendency to access the same set of memory locations repetitively
Temporal locality: reuse of specific data/resources within a short time duration
Spatial locality: use of data elements within relatively close storage locations
The locality is a predictable behavior, upon which built systems are great candidates for performance optimization. Techniques like caching, memory prefetching, and processor core branch prediction can contribute to the optimization.
Winograd Convolution
Operations with g0 ~ g2 can be precalculated. Then, the convolution operation is saved from 6 MUL operations to only 4, with 1.5 times faster. For 2-d matrix convolutions, let's say 4x4 matrix with 3x3 convolution kernel, it can be saved from 2 * 2 * 9 = 35 MULs to 4 * 4 = 16 MULs with 2.25 times faster.
Reference
Last updated