Efficient Convolution Operation
Last updated
Last updated
Sliding window over the image and do multi-adds on patch level one by one.
To better utilize CPU cache and the locality of reference, we convert convolution operations into matrix multiplication.
Processor tendency to access the same set of memory locations repetitively
Temporal locality: reuse of specific data/resources within a short time duration
Spatial locality: use of data elements within relatively close storage locations
The locality is a predictable behavior, upon which built systems are great candidates for performance optimization. Techniques like caching, memory prefetching, and processor core branch prediction can contribute to the optimization.
Operations with g0 ~ g2 can be precalculated. Then, the convolution operation is saved from 6 MUL operations to only 4, with 1.5 times faster. For 2-d matrix convolutions, let's say 4x4 matrix with 3x3 convolution kernel, it can be saved from 2 * 2 * 9 = 35 MULs to 4 * 4 = 16 MULs with 2.25 times faster.