Convolution Layer is a systematic template matching mechanism with learnable templates. It has a set of filters, each of the same depth as the input vector and a spatial extent as specified by the user(receptive field). A feature map is generated by sliding one filter across the entire input volume by certain stride. By stacking together all the feature maps generated by all filters, we get the output volume.
There are two features in this design that call for attention. One is local connectivity. When we create the feature map, each value is only dependent on a small patch of the original image, with depth included. In other words, each neuron(a filter with same depth) is only connected to local input values. In comparison, in a Fully Connected Layer, each neuron connects to all input values. In this way we not only reduce the parameters by a lot but also instruct the layer where in the input to focus on.
The second feature by design is parameter sharing, which means the same filter is used across the entire image to generate the feature map. The reason behind this is that lots of low level features are repentant across the image so there is no need to relearn. This also greatly reduces the number of parameters. In cases where features are supposed to differ by a lot, we should have different filters at different (x, y) positions, then we would forfeit such parameter sharing schema and instead learn each filter separately. Network constructed like this is called Locally Connected Layer. Convolution Layer is a special Locally Connected Layer with sharing parameters.
Convolution Layer is used to filter out image features, from abstract features in the beginning layer to more high-level ones in the latter layers. The cool part is that Layer will learn useful filters at different layers by itself.
1 x 1 Convolution Layer helps to reduce depth dimension of the input.
Tricks & Caveats
We often prefer stacking small Conv Layers to a single Conv Layer with large receptive field. First reason is that with the same effective receptive field, stacking smaller layers need much less parameters. Second reason is that in between small layers we can put more ReLU non-linearity, whereas we don’t have the option with a single giant layer. The non-linearity promises more expressiveness.