Cost Functions

When you think about cost, what comes to your mind?

Cost is the estimated price that we have to pay for a service. Likewise, a cost function measures the estimated tradeoff of the accuracy of a “cut” that’s taken by the model for predicting our desired values.

A cost or a loss function is a measure of the error between the actual value and the predicted value. It is very crucial to a machine learning model since it is the feedback on the model’s performance.

There are multiple Cost functions, that are used in regression or classification tasks. They have been discussed in the post below.

Cost functions for regression tasks

A regressor deals with the prediction of a continuous variable based on a function that has been modeled on historical data. You can read about regressors over here

Mean Absolute Error

It is known as L1 loss. If y is your actual value and y’ is your predicted value, the mean absolute error(MAE) is calculated as follows :

MAE loss
MAE loss

The sklearn.metrics have a mean_absolute_error function. Let us see what is the contribution of each error to the total error by using the code below.

from numpy import asarray
from sklearn.metrics import mean_absolute_error

#array of actual and predicted values
actual = asarray([1.2, 1.7, 1, 0.7, 1, 0.2, 0.4, 0.2, 0.1, 0.3])
predicted = asarray([0.8, 1.9, 0.9, 1.4, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3])

error = mean_absolute_error(actual,predicted)
#contribution to each error
for i in range(len(actual)):
  s=abs(actual[i]-predicted[i])

  print('Contribution to error by %.3f and %.3f is %.3f'%(actual[i], predicted[i], s) )
print('Mean absolute Error: %.3f' % error)

Contribution to error by 1.200 and 0.800 is 0.400
Contribution to error by 1.700 and 1.900 is 0.200
Contribution to error by 1.000 and 0.900 is 0.100
Contribution to error by 0.700 and 1.400 is 0.700
Contribution to error by 1.000 and 0.800 is 0.200
Contribution to error by 0.200 and 0.100 is 0.100
Contribution to error by 0.400 and 0.400 is 0.000
Contribution to error by 0.200 and 0.200 is 0.000
Contribution to error by 0.100 and 0.100 is 0.000
Contribution to error by 0.300 and 0.300 is 0.000
Mean absolute Error: 0.170

This error gives proportional weightage to all deviations from the true value regardless of the magnitude of their deviation as you can see above. Like the error between 1.2 and 0.8 is large so the contribution is 0.4 but the error between 0.2 and 0.1 is small so the contribution is 0.1. One can say, it is proportional to the deviation from the actual value.

With an outlier in the actual data…

from numpy import asarray
from sklearn.metrics import mean_absolute_error

#array of actual and predicted values
actual = asarray([1.2, 1.7, 1, 0.7, 5, 0.2, 0.4, 0.2, 0.1, 0.3])
predicted = asarray([0.8, 1.9, 0.9, 1.4, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3])

error = mean_absolute_error(actual,predicted)
for i in range(len(actual)):
  s=abs(actual[i]-predicted[i])

  print('Contribution to error by %.3f and %.3f is %.3f'%(actual[i], predicted[i], s) )
print('Mean absolute Error: %.3f' % error)

Contribution to error by 1.200 and 0.800 is 0.400
Contribution to error by 1.700 and 1.900 is 0.200
Contribution to error by 1.000 and 0.900 is 0.100
Contribution to error by 0.700 and 1.400 is 0.700
Contribution to error by 5.000 and 0.800 is 4.200
Contribution to error by 0.200 and 0.100 is 0.100
Contribution to error by 0.400 and 0.400 is 0.000
Contribution to error by 0.200 and 0.200 is 0.000
Contribution to error by 0.100 and 0.100 is 0.000
Contribution to error by 0.300 and 0.300 is 0.000
Mean absolute Error: 0.570


It is robust to outliers(see our post about outliers). The example above demonstrates that outliers do not affect the final error to a large extent.

Visual Representation of MAE (Inspiration: Data Analytics)

Mean Squared Error

If y is your actual value and y’ is your predicted value,

MSE loss
MSE loss

Mean Squared Error(MSE) is also known as L2 loss.
MSE sums the square of the difference between the actual and the predicted value. Let us see an example that will allow us to understand it properly.

The simplest way to calculate a mean squared error is to use Scikit-Learn (sklearn). The metrics module comes with a function mean squared error(), which allows you to pass in true and predicted values. Let’s see how to calculate the MSE with sklearn:

from numpy import asarray
from sklearn.metrics import mean_squared_error

#array of actual and predicted values
actual = asarray([1.2, 1.7, 1, 0.7, 1, 0.2, 0.4, 0.2, 0.1, 0.3])
predicted = asarray([0.8, 1.9, 0.9, 0.6, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3])

error = mean_squared_error(actual,predicted)
for i in range(len(actual)):
  s=(actual[i]-predicted[i])**2

  print('Contribution to error by %.3f and %.3f is %.3f'%(actual[i], predicted[i], s) )
print('Mean Squared Error: %.3f' % error)

Contribution to error by 1.200 and 0.800 is 0.160
Contribution to error by 1.700 and 1.900 is 0.040
Contribution to error by 1.000 and 0.900 is 0.010
Contribution to error by 0.700 and 0.600 is 0.010
Contribution to error by 1.000 and 0.800 is 0.040
Contribution to error by 0.200 and 0.100 is 0.010
Contribution to error by 0.400 and 0.400 is 0.000
Contribution to error by 0.200 and 0.200 is 0.000
Contribution to error by 0.100 and 0.100 is 0.000
Contribution to error by 0.300 and 0.300 is 0.000
Mean Squared Error: 0.027


Thus, the errors do not contribute equally to the loss. The one farther away from the actual value has more impact(due to the squaring) than the one near to the actual value.

With an outlier in the actual data…

from numpy import asarray
from sklearn.metrics import mean_squared_error

#array of actual and predicted values
actual = asarray([1.2, 1.7, 1, 4.6, 1, 0.2, 0.4, 0.2, 0.1, 0.3])
predicted = asarray([0.8, 1.9, 0.9, 0.7, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3])

error = mean_squared_error(actual,predicted)
for i in range(len(actual)):
  s=(actual[i]-predicted[i])**2

  print('Contribution to error by %.3f and %.3f is %.3f'%(actual[i], predicted[i], s) )
print('Mean Squared Error: %.3f' % error)

Contribution to error by 1.200 and 0.800 is 0.160
Contribution to error by 1.700 and 1.900 is 0.040
Contribution to error by 1.000 and 0.900 is 0.010
Contribution to error by 4.600 and 0.700 is 15.210
Contribution to error by 1.000 and 0.800 is 0.040
Contribution to error by 0.200 and 0.100 is 0.010
Contribution to error by 0.400 and 0.400 is 0.000
Contribution to error by 0.200 and 0.200 is 0.000
Contribution to error by 0.100 and 0.100 is 0.000
Contribution to error by 0.300 and 0.300 is 0.000
Mean Squared Error: 1.547


As you can see the error with an outlier is way greater than the error without one. Thus, the contribution by the outlier to the error is amplified significantly. Since the outlier affects the final error and increases it significantly, it is not very robust to outliers.

Visual Representation of MAE (Inspiration: Data Analytics)

The area of the squares is the contribution of that pair of values to the total error.

Root Mean Square Error

It is obtained when you take the square root of the MSE. Researchers recommend using MAE instead of root mean square error(RMSE) since each error doesn’t influence the final value proportionally in RMSE. It is calculated as follows:

RMSE loss
RMSE loss

Let us see what is the contribution of each error to the total error by using the code below.

from numpy import asarray
import math
from sklearn.metrics import mean_squared_error

#array of actual and predicted values
actual = asarray([1.2, 1.7, 1, 0.7, 1, 0.2, 0.4, 0.2, 0.1, 0.3])
predicted = asarray([0.8, 1.9, 0.9, 1.6, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3])

error = mean_squared_error(actual,predicted)
for i in range(len(actual)):
  s=(actual[i]-predicted[i])**2

  print('Contribution to error by %.3f and %.3f is %.3f'%(actual[i], predicted[i], s) )
print('Root Mean Squared Error: %.3f' % math.sqrt(error))

Contribution to error by 1.200 and 0.800 is 0.160
Contribution to error by 1.700 and 1.900 is 0.040
Contribution to error by 1.000 and 0.900 is 0.010
Contribution to error by 0.700 and 1.600 is 0.810
Contribution to error by 1.000 and 0.800 is 0.040
Contribution to error by 0.200 and 0.100 is 0.010
Contribution to error by 0.400 and 0.400 is 0.000
Contribution to error by 0.200 and 0.200 is 0.000
Contribution to error by 0.100 and 0.100 is 0.000
Contribution to error by 0.300 and 0.300 is 0.000
Root Mean Squared Error: 0.327

With an outlier in the actual data…

from numpy import asarray
import math
from sklearn.metrics import mean_squared_error

#array of actual and predicted values
actual = asarray([1.2, 1.7, 1, 4.6, 1, 0.2, 0.4, 0.2, 0.1, 0.3])
predicted = asarray([0.8, 1.9, 0.9, 0.7, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3])

error = mean_squared_error(actual,predicted)
for i in range(len(actual)):
  s=(actual[i]-predicted[i])**2

  print('Contribution to error by %.3f and %.3f is %.3f'%(actual[i], predicted[i], s) )
print('Root Mean Squared Error: %.3f' % math.sqrt(error))

Contribution to error by 1.200 and 0.800 is 0.160
Contribution to error by 1.700 and 1.900 is 0.040
Contribution to error by 1.000 and 0.900 is 0.010
Contribution to error by 4.600 and 0.700 is 15.210
Contribution to error by 1.000 and 0.800 is 0.040
Contribution to error by 0.200 and 0.100 is 0.010
Contribution to error by 0.400 and 0.400 is 0.000
Contribution to error by 0.200 and 0.200 is 0.000
Contribution to error by 0.100 and 0.100 is 0.000
Contribution to error by 0.300 and 0.300 is 0.000
Root Mean Squared Error: 1.244


As you can see that the error with an outlier is way greater than the error without one. Thus, the contribution by the outlier to the error is amplified significantly. Since the outlier affects the final error and increases it significantly, it is not very robust to outliers.

Huber Loss

Huber loss is used for regression tasks. It is less sensitive to outliers in the data since it only squares the errors in a certain interval defined by delta. Thus, it is a combination of L1 and L2 loss and gives us the best of both worlds.

Huber loss
Huber loss

Let us see how to calculate Huber loss with the code below. Tensorflow Implementation for Huber Loss:

from numpy import asarray
import tensorflow as tf

#array of actual and predicted values
actual = asarray([1.2, 1.7, 1, 0.7, 1, 0.2, 0.4, 0.2, 0.1, 0.3])
predicted = asarray([0.8, 1.9, 0.9, 5.6, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3])

error = tf.keras.losses.huber(actual,predicted)
print('Huber Loss: %.3f' % error)

Huber Loss: 0.453

With an outlier in the actual data…

from numpy import asarray
import tensorflow as tf

#array of actual and predicted values
actual = asarray([1.2, 1.7, 5, 0.7, 1, 0.2, 0.4, 0.2, 0.1, 0.3])
predicted = asarray([0.8, 1.9, 0.9, 5.6, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3])

error = tf.keras.losses.huber(actual,predicted)
print('Huber Loss: %.3f' % error)

Huber Loss: 0.812

As you can observe, the outlier doesn’t have a significant impact on the loss as compared to RMSE or MSE.


When to use these?

Comparison of MSE, Huber and MAE
Comparison of MSE, Huber, and MAE

Here are a few points to know before you choose any metric:

  • RMSE is always greater than MAE
  • RMSE is not a reliable measure of ‘average error’ and should not be used to compare the average performance of 2 models[1]
  • Use RMSE over MAE when the distribution is normal
  • RMSEs are preferred for data assimilation applications and while calculating the model’s error sensitivities[2]
  • MAE is robust to outliers
  • Use MSE when you want to give importance to outliers and Huber when you want to give selective importance

Cost functions for Classification tasks

Say you have an image or some data about some physical characteristics of an animal and you want to classify this data into 3 classes, being a cat, dog, and mouse, you use classifiers. 

In classification problems, your model gives you a probability of the input variables belonging to that class. 

Cross entropy

Cross entropy is a measure of loss used in classification tasks. Since we usually have probability as an output, if your correct classification class is a dog and the expected probability is 1, but you are getting a probability of 0.2 then your model must be penalized more than if you get a probability of say 0.65. Thus the prediction of 0.2(which is drastically wrong) will be penalized more than a prediction of 0.65.

Cross entropy loss vs probability
Cross entropy loss vs probability

The graph of this loss is shown below for a label whose value should be one ideally. As you can see, it increases dramatically as the predicted probability deviates from the desired value of 1. 

Cross entropy for only two classes is called binary cross-entropy. If y is the binary indicator and y’ is your predicted probability, the loss is calculated as

Binary cross entropy
Binary cross-entropy

Let’s see an example that will demonstrate how binary cross-entropy is calculated. We have two images for which our classifier predicts if it has a dog or a cat in it( 2 classes).

Probabilities for each image
Actual Classes and their probabilities

Binary cross entropy for Image 1= -[1*log(0.3) + (1-1)*log(0.7)] = 0.52
Binary cross entropy for Image 2= -[0*log(0.3) + (1-0)*log(0.7)] = 0.15

You can observe that the model is penalized more if it deviates from the correct label.

Multi-class Cross Entropy Loss

When we have multiple classes(more than 2), we calculate the loss for each class separately and sum the loss obtained. Multi-class means you have an image and you want to classify it as a dog, cat, or mouse. The image can belong to only one class. Mathematically, it is represented as follows:

categorical cross entropy
Categorical cross-entropy

where,
p is the predicted probability of each class
y is the binary indicator of whether the class is correct( 0,1)
M is the total number of classes

Let us see an example, we have 3 images and they have to be classified as either a cat or dog or a mouse.

These are the probabilities obtained for each image for each class

Probabilities for each image
Actual Classes and their probabilities

We calculate the cross entropy as CE= -log(0.7)*1 + -log(0.5)*1 + -log(0.3)*1 = 0.98

As you can observe, the loss of each class is added to the final loss.

Hinge Loss

Hinge loss is mostly used in Support Vector Machines (see our post about SVMs). It penalizes the model depending on the distance from the classification boundary established by the SVM as shown in the graph below.

Image from Stackexchange
Image from Stackexchange

It is calculated as follows,

Hinge loss
Hinge loss

where
t, which is the binary indicator of the selection of class can be -1(negative) or +1(positive)
y is the prediction by the SVM.

References

[1] Willmott, C. J., & Matsuura, K. (2005). Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Research, 30(1), 79–82. http://www.jstor.org/stable/24869236

[2]Chai, Tianfeng & Draxler, R.. (2014). Root mean square error (RMSE) or mean absolute error (MAE)?. Geosci. Model Dev.. 7. 10.5194/gmdd-7-1525-2014.

Similar Posts

Leave a Reply

Your email address will not be published.