Let’s say, you want to create a model that can differentiate if the animal in an image is a cat or a dog. We immediately think of a convolutional neural network, but why can’t we use a neural network? They both are pretty similar…
They both make use of neurons as their basic functional unit and the weights and biases associated with each neuron are updated during the training.
But we don’t, right…..
The basic difference that you can observe is that, unlike a regular/dense neural network (DNN) where inputs are of the form of a 1D array, the CNN architecture uses 3D arrays as its input, representing the height, width, and depth of the image. If we were to train a regular neural network on these images, we would first have to flatten them to a large 1D array.
Let’s take an easy example of the MNIST dataset. Each image in MNIST is 20×20 pixels with no depth since it is in greyscale. In order to build a digit recognizer using DNN, we would first reshape the 2D array of each image to a 1D array of 400 (=20×20) pixels. The resulting 1D array would look something like below.
If you notice carefully, the above transformation of the data creates 2 major issues in building an efficient DNN model:
- The resultant array is huge in its length, that too when the source image is so tiny. The array length would be humongous for regular images of 768×1024 pixels with depth of 3 channels (RGB). Imagine creating a DNN with ~2.3mm( 768x1024x3) neurons in its first layer just to handle the input!
- We lose the spatial information while reshaping the data. In the above example, we can see that values 5 and 12 are right next to each other. This information is lost in the 1D array.
What is CNN?
Convolutional Neural Networks are also known as CNNs or ConvNets. They are used as algorithms in tasks like image classification, object recognition, and face recognition since they can identify and learn features from the image. CNN can capture the Spatio-temporal features in an image, which enables it to identify this:
as well as……
The basic flow of a CNN is as follows
The CNN architecture mainly consists of three layers: convolutional layer, pooling layer, and fully connected layer.
The convolutional layer is the main building block which contains a set of kernels or filters. We get a feature map (the output) when these kernels are convoluted with the input matrix. Convolution is when you multiply each element of the input with its corresponding element in the filter(like a dot product) and add it together. This weighted sum is then passed through an activation function. The most popular activation function is the rectified linear unit (ReLU). We use this to filter the features that are going to be passed on. The layer is used to detect features that are present in the image.
Kernel Vs Filter
A kernel is an element used in convolution operations. It suppresses unwanted features in the image so that the desirable ones can be detected by the network. It extracts features from the images.
A basic distinction between a filter and a kernel is that a filter is a group of kernels.
We can compute the size of the output obtained from the convolutional layer by using the following terms :
- Receptive field size (F): It is the size of the filter or kernel .
- Stride (S): The step size with which the filter moves over an input feature map is called the stride.
- Zero Padding (P): We know that convolution reduces the size of the feature map. The zero padding is added to the input feature map as it allows us to control the size of the output feature map. We pad the input matrix with zeros around the border.
Now, with the input size of the feature map or input matrix(W), the stride (S) with which the filter will move, the receptive field size (F) and the amount of zero padding (P), we can calculate the size of the output feature map by:
We usually obtain low-level features like edges and corners from the convolutional layers. The pooling layer merges similar features in order to obtain high-level or dominant features. This layer reduces the dimensionality of the matrix. In a typical pooling operation, a maximum value of a small patch of feature maps covered by the kernel is determined. We call this operation Max pooling. There is another type of pooling called Average Pooling. It returns the average of all the values from the portion of the image covered by the kernel. Since max polling increases the contrast around edges in an image, it is generally used for object recognition and classification by objects in an image
Fully connected layer
The Flatten layer is the input to this layer. The fully connected layer draws an inference from the feature maps obtained from the previous layers and produces a 1D matrix as an output. The layer gives the processed output to the Softmax or Sigmoid activation function which then gives us the output.
Some links in case you want to explore more about the topic: