ResNet is a type of artificial neural network that is typically used in the field of image recognition. It has been presented as an alternative to deeper neural networks, which are quite difficult to train. The first problem with deeper neural networks was the vanishing/exploding gradients problem. When this was resolved with optimizers, a degradation(of accuracy) problem was highlighted. This problem was caused due to higher training errors. In order to resolve this, Kaiming He et al.  proposed a deeper residual learning framework that has shortcut connections. Shortcut connects means skipping one or more layers.
You can clearly see the shortcuts in a residual network vs a plain neural network and a VGG-19 model. As we can see, residual networks are made up of such blocks that are stacked upon each other.
Each unit can be expressed in a general form:
where xl and xl+1 are input and output of the lth unit, and F(x, Wi) represents the residual mapping to be learned, h(xl) = xl is an identity mapping(see this post on StackExchange to understand identity mapping) and f is a ReLU function (see our post).
Understanding this intuitively
In ResNet, we bypass the layers in between and send the information from a layer to a hidden /deeper layer via the shortcut. The information is passed without it being altered, thus it is passed as an identity mapping. In a regular deep network, since we have a lot of non-linear layers, the solver may have difficulties approximating the identity function. But when the network has residual connections, it can simply drive the weight of layers to become approximately zero which would give us identity mappings.
*f(x) does not have to be one hidden layer, could be any arbitrary number of hidden layers
But why is this called residual learning?
We have the input x and H(x) as the underlying mapping, that needs to be learned. He et al. found that if we subtract the input from the underlying mapping we get a residual function i.e F(x)= H(x) – x. A residual for a value is the difference between an observed value and the estimate or prediction of the value.
This residual function will make learning easier since if we want identity mappings, we will just make F(x) = 0, which will give us H(x) = x, which is an identity mapping.
The summary here is that we approximate the residual function instead of approximating the underlying mapping(or the regular output), which makes learning the features easier.
Some popular variants of ResNet
A ResNeXt introduces cardinality, a dimension that is not present in ResNet. We can observe the dimension called cardinality(number of parallel paths) in the image below which is an integral part of ResNeXt. A ResNeXt has parallel paths which give us an aggregate of transformations of the same topology without changing the design of the network. This is different than the Inception network where the transformations have different topologies.
Experiments have shown that increasing cardinality is a more effective way of gaining accuracy than going deeper or wider.Aggregated Residual Transformations for Deep Neural Networks
ResNEtV2 has a different residual unit than ResNet. The main difference mathematically is that f in the formula above is an identity mapping rather than a ReLU. Since both h(xl) and f(yl) are identity mappings, the information can be propagated in the forward and backward directions in the network. Due to the identity mapping, this path is kept ‘clean’ which makes it easier for optimization. This also gives us better results.
In simpler terms, in ResNet (a) we can see that ReLU is where the shortcut joins the layers, in ResNeXt (b) the ReLU is in the shortcut. Thus this makes the propagation of data easier in the network itself since not only positive but negative residues would be propagated. If ReLU was there, only positive or zero residues would be there, this would reduce the learning capacity of the network.
Implemetation of ResNet in Keras
Keras has many types of ResNet models(same concept but different number of layers) pretrained on the ImageNet Dataset.
from tensorflow.keras.applications.resnet50 import ResNet50 model = ResNet50(weights='imagenet')
Replace 50 with 101 or 152 for ResNet-101 or ResNet-152 respectively.
from tensorflow.keras.applications.resnet_v2 import ResNet50V2 model = ResNet50V2(weights='imagenet')
Replace 50 with 101 or 152 for ResNet-101V2 or ResNet-152V2 respectively.
!pip install git+https://github.com/qubvel/classification_models.git
import keras from classification_models.keras import Classifiers ResNeXt50, preprocess_input = Classifiers.get('resnext50') model = ResNeXt50(include_top = False, input_shape=(224, 224, 3), weights='imagenet')
Replace 50 with 101 for ResNeXt-101
In case you want to use different models like ResNet[18, 34], SeResNet[18, 34, 50, 101, 152] and SeResNeXt[50, 101] you can refer to this Github Repository.
Accuracy of ResNet and Varients with other Neural Networks
You can refer to the bubble chart above to determine what ResNet model to use based on the accuracy as well as the trainable parameters. The sizes correspond to the approximate trainable parameters in the network. While choosing a Resnet architecture, you should keep in mind the data that you have. A very deep network with a lot of trainable parameters and a very shallow network with very few trainable parameters will affect your accuracy adversely.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun (2015). Deep Residual Learning for Image Recognition. CoRR, abs/1512.03385.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun (2016). Identity Mappings in Deep Residual Networks. CoRR, abs/1603.05027.
Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, & Kaiming He (2016). Aggregated Residual Transformations for Deep Neural Networks. CoRR, abs/1611.05431.
Bianco, Simone & Cadène, Rémi & Celona, Luigi & Napoletano, Paolo. (2018). Benchmark Analysis of Representative Deep Neural Network Architectures. IEEE Access. 6. 64270-64277. 10.1109/ACCESS.2018.2877890.