CNN terminologies

2 min readFeb 8, 2021

This is an attempt to keep all the must-know of CNN or related vocabulary under one roof. Also make use of few visualization demos mentioned here.

Terms to discuss:

Convolution operation
Input channels, Output channels, feature maps
Pooling layers
Stride
Padding
Filters & types of filters

The convolution operation (also known as Cross-correlations) is a simple mathematical operation between input channels and filters. The resulting layer is output channel called feature map. The input channel is pixel values of an image. If image is RGB , there are 3 channels. If the image is grayscale, just one channel is present.

A filter, which is a simply a matrix, is convolved over the input channel. Common types of filters are different type of Edge filters, Sobel filters, Prewitt filters and many custom filters (chosen values). However the filter values or kernel values are learned during the training. It’s a learned parameter. Refer this visualizer to understand the impact of various kernels on an image and also the convolution operation. Generally, the kernels which are nearer to input channels in a deep neural network tend to learn simpler features like edges and the kernels which are further away from the input channels tend to learn more complex or abstract features (Ex: shapes of objects or parts of body i.e. shape of eyes and nose in a facial image).

The dimensions of the filters are defined and randomly initialized before the training. And the values are learned with the same objective of minimizing the loss as explained here.

The filter can be convolved over the input channels more intelligently if the feature map is too large by striding. Just check the impact of different stride values on the output size for any given input size. Striding simply defines how the filters convolve.

Pooling layers are often used next to the Convolution layers. Pooling reduces the computation time, since the output layer is downsampled. Feature maps produced due to convolution tend to record precise position of features in the input image. This leads to over-learning. The stride also downsamples.

Most common pooling layers : Max-pooling and avg-pooling

Padding around the input channel can help you keep the consistency in the output dimension, reducing the errors due to confusing spatial size. More importantly, the border information of the image is better accounted.

[(W−K+2P)/S]+1, where W is height or width of input channel, K is kernel size, P is padding and S is stride . This works with when height and width are same.

If height and width are different:

Output height = (Input height + padding height top + padding height bottom — kernel height) / (stride height) + 1

Output width = (Output width + padding width right + padding width left — kernel width) / (stride width) + 1

Writer: Piyush Kulkarni (Data Scientist)

CNN terminologies

Written by DaurEd

No responses yet