Cost Functions
When you think about cost, what comes to your mind?

Cost is the estimated price that we have to pay for a service. Likewise, a cost function measures the estimated tradeoff of the accuracy of a “cut” that’s taken by the model for predicting our desired values.
A cost or a loss function is a measure of the error between the actual value and the predicted value. It is very crucial to a machine learning model since it is the feedback on the model’s performance.
There are multiple Cost functions, that are used in regression or classification tasks. They have been discussed in the post below.
Cost functions for regression tasks
A regressor deals with the prediction of a continuous variable based on a function that has been modeled on historical data. You can read about regressors over here
Mean Absolute Error
It is known as L1 loss. If y is your actual value and y’ is your predicted value, the mean absolute error(MAE) is calculated as follows :

The sklearn.metrics have a mean_absolute_error function. Let us see what is the contribution of each error to the total error by using the code below.
from numpy import asarray
from sklearn.metrics import mean_absolute_error
#array of actual and predicted values
actual = asarray([1.2, 1.7, 1, 0.7, 1, 0.2, 0.4, 0.2, 0.1, 0.3])
predicted = asarray([0.8, 1.9, 0.9, 1.4, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3])
error = mean_absolute_error(actual,predicted)
#contribution to each error
for i in range(len(actual)):
s=abs(actual[i]-predicted[i])
print('Contribution to error by %.3f and %.3f is %.3f'%(actual[i], predicted[i], s) )
print('Mean absolute Error: %.3f' % error)
Contribution to error by 1.200 and 0.800 is 0.400
Contribution to error by 1.700 and 1.900 is 0.200
Contribution to error by 1.000 and 0.900 is 0.100
Contribution to error by 0.700 and 1.400 is 0.700
Contribution to error by 1.000 and 0.800 is 0.200
Contribution to error by 0.200 and 0.100 is 0.100
Contribution to error by 0.400 and 0.400 is 0.000
Contribution to error by 0.200 and 0.200 is 0.000
Contribution to error by 0.100 and 0.100 is 0.000
Contribution to error by 0.300 and 0.300 is 0.000
Mean absolute Error: 0.170
This error gives proportional weightage to all deviations from the true value regardless of the magnitude of their deviation as you can see above. Like the error between 1.2 and 0.8 is large so the contribution is 0.4 but the error between 0.2 and 0.1 is small so the contribution is 0.1. One can say, it is proportional to the deviation from the actual value.
With an outlier in the actual data…
from numpy import asarray
from sklearn.metrics import mean_absolute_error
#array of actual and predicted values
actual = asarray([1.2, 1.7, 1, 0.7, 5, 0.2, 0.4, 0.2, 0.1, 0.3])
predicted = asarray([0.8, 1.9, 0.9, 1.4, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3])
error = mean_absolute_error(actual,predicted)
for i in range(len(actual)):
s=abs(actual[i]-predicted[i])
print('Contribution to error by %.3f and %.3f is %.3f'%(actual[i], predicted[i], s) )
print('Mean absolute Error: %.3f' % error)
Contribution to error by 1.200 and 0.800 is 0.400
Contribution to error by 1.700 and 1.900 is 0.200
Contribution to error by 1.000 and 0.900 is 0.100
Contribution to error by 0.700 and 1.400 is 0.700
Contribution to error by 5.000 and 0.800 is 4.200
Contribution to error by 0.200 and 0.100 is 0.100
Contribution to error by 0.400 and 0.400 is 0.000
Contribution to error by 0.200 and 0.200 is 0.000
Contribution to error by 0.100 and 0.100 is 0.000
Contribution to error by 0.300 and 0.300 is 0.000
Mean absolute Error: 0.570
It is robust to outliers(see our post about outliers). The example above demonstrates that outliers do not affect the final error to a large extent.

Mean Squared Error
If y is your actual value and y’ is your predicted value,

Mean Squared Error(MSE) is also known as L2 loss.
MSE sums the square of the difference between the actual and the predicted value. Let us see an example that will allow us to understand it properly.
The simplest way to calculate a mean squared error is to use Scikit-Learn (sklearn). The metrics module comes with a function
mean squared error(), which allows you to pass in true and predicted values. Let’s see how to calculate the MSE with sklearn:
from numpy import asarray
from sklearn.metrics import mean_squared_error
#array of actual and predicted values
actual = asarray([1.2, 1.7, 1, 0.7, 1, 0.2, 0.4, 0.2, 0.1, 0.3])
predicted = asarray([0.8, 1.9, 0.9, 0.6, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3])
error = mean_squared_error(actual,predicted)
for i in range(len(actual)):
s=(actual[i]-predicted[i])**2
print('Contribution to error by %.3f and %.3f is %.3f'%(actual[i], predicted[i], s) )
print('Mean Squared Error: %.3f' % error)
Contribution to error by 1.200 and 0.800 is 0.160
Contribution to error by 1.700 and 1.900 is 0.040
Contribution to error by 1.000 and 0.900 is 0.010
Contribution to error by 0.700 and 0.600 is 0.010
Contribution to error by 1.000 and 0.800 is 0.040
Contribution to error by 0.200 and 0.100 is 0.010
Contribution to error by 0.400 and 0.400 is 0.000
Contribution to error by 0.200 and 0.200 is 0.000
Contribution to error by 0.100 and 0.100 is 0.000
Contribution to error by 0.300 and 0.300 is 0.000
Mean Squared Error: 0.027
Thus, the errors do not contribute equally to the loss. The one farther away from the actual value has more impact(due to the squaring) than the one near to the actual value.
With an outlier in the actual data…
from numpy import asarray
from sklearn.metrics import mean_squared_error
#array of actual and predicted values
actual = asarray([1.2, 1.7, 1, 4.6, 1, 0.2, 0.4, 0.2, 0.1, 0.3])
predicted = asarray([0.8, 1.9, 0.9, 0.7, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3])
error = mean_squared_error(actual,predicted)
for i in range(len(actual)):
s=(actual[i]-predicted[i])**2
print('Contribution to error by %.3f and %.3f is %.3f'%(actual[i], predicted[i], s) )
print('Mean Squared Error: %.3f' % error)
Contribution to error by 1.200 and 0.800 is 0.160
Contribution to error by 1.700 and 1.900 is 0.040
Contribution to error by 1.000 and 0.900 is 0.010
Contribution to error by 4.600 and 0.700 is 15.210
Contribution to error by 1.000 and 0.800 is 0.040
Contribution to error by 0.200 and 0.100 is 0.010
Contribution to error by 0.400 and 0.400 is 0.000
Contribution to error by 0.200 and 0.200 is 0.000
Contribution to error by 0.100 and 0.100 is 0.000
Contribution to error by 0.300 and 0.300 is 0.000
Mean Squared Error: 1.547
As you can see the error with an outlier is way greater than the error without one. Thus, the contribution by the outlier to the error is amplified significantly. Since the outlier affects the final error and increases it significantly, it is not very robust to outliers.

The area of the squares is the contribution of that pair of values to the total error.
Root Mean Square Error
It is obtained when you take the square root of the MSE. Researchers recommend using MAE instead of root mean square error(RMSE) since each error doesn’t influence the final value proportionally in RMSE. It is calculated as follows:

Let us see what is the contribution of each error to the total error by using the code below.
from numpy import asarray
import math
from sklearn.metrics import mean_squared_error
#array of actual and predicted values
actual = asarray([1.2, 1.7, 1, 0.7, 1, 0.2, 0.4, 0.2, 0.1, 0.3])
predicted = asarray([0.8, 1.9, 0.9, 1.6, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3])
error = mean_squared_error(actual,predicted)
for i in range(len(actual)):
s=(actual[i]-predicted[i])**2
print('Contribution to error by %.3f and %.3f is %.3f'%(actual[i], predicted[i], s) )
print('Root Mean Squared Error: %.3f' % math.sqrt(error))
Contribution to error by 1.200 and 0.800 is 0.160
Contribution to error by 1.700 and 1.900 is 0.040
Contribution to error by 1.000 and 0.900 is 0.010
Contribution to error by 0.700 and 1.600 is 0.810
Contribution to error by 1.000 and 0.800 is 0.040
Contribution to error by 0.200 and 0.100 is 0.010
Contribution to error by 0.400 and 0.400 is 0.000
Contribution to error by 0.200 and 0.200 is 0.000
Contribution to error by 0.100 and 0.100 is 0.000
Contribution to error by 0.300 and 0.300 is 0.000
Root Mean Squared Error: 0.327
With an outlier in the actual data…
from numpy import asarray
import math
from sklearn.metrics import mean_squared_error
#array of actual and predicted values
actual = asarray([1.2, 1.7, 1, 4.6, 1, 0.2, 0.4, 0.2, 0.1, 0.3])
predicted = asarray([0.8, 1.9, 0.9, 0.7, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3])
error = mean_squared_error(actual,predicted)
for i in range(len(actual)):
s=(actual[i]-predicted[i])**2
print('Contribution to error by %.3f and %.3f is %.3f'%(actual[i], predicted[i], s) )
print('Root Mean Squared Error: %.3f' % math.sqrt(error))
Contribution to error by 1.200 and 0.800 is 0.160
Contribution to error by 1.700 and 1.900 is 0.040
Contribution to error by 1.000 and 0.900 is 0.010
Contribution to error by 4.600 and 0.700 is 15.210
Contribution to error by 1.000 and 0.800 is 0.040
Contribution to error by 0.200 and 0.100 is 0.010
Contribution to error by 0.400 and 0.400 is 0.000
Contribution to error by 0.200 and 0.200 is 0.000
Contribution to error by 0.100 and 0.100 is 0.000
Contribution to error by 0.300 and 0.300 is 0.000
Root Mean Squared Error: 1.244
As you can see that the error with an outlier is way greater than the error without one. Thus, the contribution by the outlier to the error is amplified significantly. Since the outlier affects the final error and increases it significantly, it is not very robust to outliers.
Huber Loss
Huber loss is used for regression tasks. It is less sensitive to outliers in the data since it only squares the errors in a certain interval defined by delta. Thus, it is a combination of L1 and L2 loss and gives us the best of both worlds.

Let us see how to calculate Huber loss with the code below. Tensorflow Implementation for Huber Loss:
from numpy import asarray
import tensorflow as tf
#array of actual and predicted values
actual = asarray([1.2, 1.7, 1, 0.7, 1, 0.2, 0.4, 0.2, 0.1, 0.3])
predicted = asarray([0.8, 1.9, 0.9, 5.6, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3])
error = tf.keras.losses.huber(actual,predicted)
print('Huber Loss: %.3f' % error)
Huber Loss: 0.453
With an outlier in the actual data…
from numpy import asarray
import tensorflow as tf
#array of actual and predicted values
actual = asarray([1.2, 1.7, 5, 0.7, 1, 0.2, 0.4, 0.2, 0.1, 0.3])
predicted = asarray([0.8, 1.9, 0.9, 5.6, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3])
error = tf.keras.losses.huber(actual,predicted)
print('Huber Loss: %.3f' % error)
Huber Loss: 0.812
As you can observe, the outlier doesn’t have a significant impact on the loss as compared to RMSE or MSE.
When to use these?

Here are a few points to know before you choose any metric:
- RMSE is always greater than MAE
- RMSE is not a reliable measure of ‘average error’ and should not be used to compare the average performance of 2 models[1]
- Use RMSE over MAE when the distribution is normal
- RMSEs are preferred for data assimilation applications and while calculating the model’s error sensitivities[2]
- MAE is robust to outliers
- Use MSE when you want to give importance to outliers and Huber when you want to give selective importance
Cost functions for Classification tasks
Say you have an image or some data about some physical characteristics of an animal and you want to classify this data into 3 classes, being a cat, dog, and mouse, you use classifiers.
In classification problems, your model gives you a probability of the input variables belonging to that class.
Cross entropy
Cross entropy is a measure of loss used in classification tasks. Since we usually have probability as an output, if your correct classification class is a dog and the expected probability is 1, but you are getting a probability of 0.2 then your model must be penalized more than if you get a probability of say 0.65. Thus the prediction of 0.2(which is drastically wrong) will be penalized more than a prediction of 0.65.

The graph of this loss is shown below for a label whose value should be one ideally. As you can see, it increases dramatically as the predicted probability deviates from the desired value of 1.
Cross entropy for only two classes is called binary cross-entropy. If y is the binary indicator and y’ is your predicted probability, the loss is calculated as

Let’s see an example that will demonstrate how binary cross-entropy is calculated. We have two images for which our classifier predicts if it has a dog or a cat in it( 2 classes).


Binary cross entropy for Image 1= -[1*log(0.3) + (1-1)*log(0.7)] = 0.52
Binary cross entropy for Image 2= -[0*log(0.3) + (1-0)*log(0.7)] = 0.15
You can observe that the model is penalized more if it deviates from the correct label.
Multi-class Cross Entropy Loss
When we have multiple classes(more than 2), we calculate the loss for each class separately and sum the loss obtained. Multi-class means you have an image and you want to classify it as a dog, cat, or mouse. The image can belong to only one class. Mathematically, it is represented as follows:

where,
p is the predicted probability of each class
y is the binary indicator of whether the class is correct( 0,1)
M is the total number of classes
Let us see an example, we have 3 images and they have to be classified as either a cat or dog or a mouse.
These are the probabilities obtained for each image for each class


We calculate the cross entropy as CE= -log(0.7)*1 + -log(0.5)*1 + -log(0.3)*1 = 0.98
As you can observe, the loss of each class is added to the final loss.
Hinge Loss
Hinge loss is mostly used in Support Vector Machines (see our post about SVMs). It penalizes the model depending on the distance from the classification boundary established by the SVM as shown in the graph below.

It is calculated as follows,

where
t, which is the binary indicator of the selection of class can be -1(negative) or +1(positive)
y is the prediction by the SVM.
References
[1] Willmott, C. J., & Matsuura, K. (2005). Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Research, 30(1), 79–82. http://www.jstor.org/stable/24869236
[2]Chai, Tianfeng & Draxler, R.. (2014). Root mean square error (RMSE) or mean absolute error (MAE)?. Geosci. Model Dev.. 7. 10.5194/gmdd-7-1525-2014.