In machine learning, there are many ways (an infinite number, really) of solving any one problem. Thus it is important to have an objective criterion for assessing the accuracy of candidate approaches and for selecting the right model for a dataset at hand.
In this post, we’ll discuss the concepts of underfitting and overfitting, and how these phenomena are related to the statistical quantities bias and variance. Finally, we will discuss how these concepts can be applied to select a model that will accurately generalize to datasets.
Firstly, we split our data into two parts before building a machine learning model, one for training the model i.e., “Training Data” and another one for monitoring the model’s accuracy on unknown data i.e., “Testing Data”. If the machine learning model is not accurate, it can make predictions errors, and these prediction errors are usually known as Bias and Variance. In machine learning, these errors will always be present as there is always a slight difference between the model predictions and actual predictions and our main aim to reduce these errors to make better predictions in our model.
Now let’s go through the following one by one.
“Bias occurs when an algorithm has limited flexibility to learn the true signal from the dataset. i.e. the results are favored towards a specific outcome.” In other words, Bias error occurs while making predictions, a difference occurs between prediction values made by the model and actual values/expected values, and this difference is known as bias errors or Errors due to bias. Each algorithm begins with some amount of bias because bias occurs from assumptions in the model, which makes the target function simple to learn.
The overall error associated with “Training Data” is termed as bias.
A model has either:
- High Bias: Predicting more assumption about the label function i.e. High training data error / low training data accuracy and the model becomes unable to capture the important features of our dataset. A high bias model also cannot perform well on new data.
- Low Bias: Predicting less assumption about the label function i.e. Low training data error / high training data accuracy.
Since bias is defined as the inability of a machine learning model to capture the true relationship of the data, we can say that the straight line has high bias and the squiggly line has low bias.
Examples of low bias machine learning algorithms: Decision Trees, k-Nearest Neighbors and Support Vector Machine.
Examples of high bias machine learning algorithms: Linear Regression, Linear Discriminant Analysis, and Logistic Regression.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear algorithm often has low bias.
We can reduce high bias through the following steps:
- Increase the input features as the model is underfitted.
- You can also try decreasing the alpha parameter of Regularization https://medium.datadriveninvestor.com/the-art-of-regularization-caca8de7614e.
- Use more complex models, such as including some polynomial.
When a model does not perform as well as it does with the trained data set, there is a possibility that the model has a variance. “Variance is a measure of the degree of the spread of the predicted results.” As a result of using different training data, the variance would indicate how much variation the prediction would exhibit. In simpler terms, variance indicates how much a random variable differs from what it should be. Ideally, a model should not differ too much between training datasets, which means the algorithm should be good at understanding hidden mappings between input and output variables.
The overall error associated with testing data is termed a variance.
A variance error can be either low variance or high variance.
- High Variance: The model has trained with a lot of noise and irrelevant data; it becomes very flexible and makes wrong predictions for new data points because it has turned itself to the data points of the training set. i.e. High testing data error / low testing data accuracy.
Examples of high variance machine learning algorithms are Decision Trees, k-Nearest Neighbors, and Support Vector Machines.
- Low Variance: Predicting small changes to the estimate of the label function with changes to the training dataset i.e. low testing data error / high testing data accuracy.
Examples of low variance machine learning algorithms include Linear Regression, Linear Discriminant Analysis and Logistic Regression.
Real-world example: Consider the case of a student named ‘Shivam’ who is studying for the IIT entrance exam. Shivam enrols in a coaching program to achieve his goal of being accepted into one of the IITs. This coaching has been working with Shivam for the past two years. In this coaching, Shivam will take multiple practice exams to assess his readiness. This is the ‘training data’ for Shivam. Finally, after two years of study, Shivam will sit for the JEE exam, which will serve as Shivam’s ‘testing data’. Since it will assess Shivam’s output accuracy.
Assume that when taking the practise exams in the coaching, Shivam does exceptionally well. It’s regarded as having a low bias. Since the training accuracy is high and the training error is low. What if Shivam does badly on these coaching practice tests? Yeah…, you got it right; it is considered to have a high bias.
Let’s take a look at variance now because it is related to testing data. The final JEE exam serves as testing data for Shivam. Shivam will be either nervous or confident (depending upon the training) when he eventually appears for the JEE exam after 2 years of intensive preparation. Shivam is said to have gotten a high percentile on the test. This is a low-variance case. Since the testing accuracy is high and the testing error is low. There is a high variance if Shivam fails miserably in the JEE exam.
In simple terms, Bias = A simple model that under-fits the data
conversely…Variance = A complex model that over-fits the data
When a model has not learned the patterns in the training data well and is unable to generalize well on the new data, it is known as underfitting. An underfit model has poor performance on the training data and will result in unreliable predictions and to avoid the overfitting in the model, the fed of training data can be stopped at an early stage, due to which the model may not learn enough from the training data. As a result, it may fail to find the best fit of the dominant trend in the data. Also, these kinds of models are very simple to capture the complex patterns in data like Linear and logistic regression.
In the case of underfitting, the model is not able to learn enough from the training data, and hence it reduces the accuracy and produces unreliable predictions.
we can avoid underfitting:
- By increasing the training time of the model.
- By increasing the number of features.
Underfitting occurs due to high bias and low variance.
How to identify High Bias?
Due to its inability to identify patterns in data, it performs poorly on training and test sets. As there is a large difference between predicted and actual values, evaluation metrics like accuracy and f1 score are very low for such models.
How to Fix High Bias?
High bias is due to a simple model and we also see a high training error. To fix that we can do following things
- Add more input features
- Add more complexity by introducing polynomial features
- Decrease Regularization term
When a model performs very well for training data but has poor performance with test data (new data), it is known as overfitting. In this case, the machine learning model learns the details and noise in the training data such that it negatively affects the performance of the model on test data. In simpler words, Overfitting occurs when our machine learning model tries to cover all the data points or more than the required data points present in the given dataset. Because of this, the model starts caching noise and inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of the model. The overfitted model has low bias and high variance.
The chances of occurrence of overfitting increase as much we provide training to our model. It means the more we train our model, the more chances of occurring the overfitted model. It is the main problem that occurs in supervised learning. So, the goal of regression model is to predict the best fit line of the model and when the model try to cover all the points it gets overfitted.
There are some ways by which we can reduce the occurrence of overfitting in our model.
- Training with more data
- Removing features
- Early stopping the training
Overfitting can happen due to low bias and high variance.
How to identify High Variance?
In a training set, a model with high variance performs well, but poorly in a testing set. The model does not generalize well and performs poorly on data sets it has not seen previously. Due to this, the test accuracy will be low and the training accuracy will be high.
How to fix High Variance?
High variance is due to a model that tries to fit most of the training dataset points and hence gets more complex. To resolve high variance issue we need to work on the following steps:
- Getting more training data
- Reduce input features
- Increase Regularization term
Balanced Bias And Variance In the model
After the initial run of the model, you will notice that model doesn’t do well on validation set as you were hoping. As the model is impacted due to high bias or high variance. In other words, either an under-fitting problem or an over-fitting problem. Ideally, we need to find a golden mean. The idea is to reduce the training error as well as validation and test error to a point where the model is able to generalize well on unseen data.
Below are some methods to solve the bias-variance dilemma.
- Choose appropriate algorithm
- Reduce dimensions
- Reduce error
- Use regularization techniques
- Use ensemble models, bagging, resampling, etc.
- Fit model parameters, e.g., find the best k for KNN, find the optimal C value for SVM.
- Tune impactful hyperparameters
- Use proper model performance metrics
- Use systematic cross-validation
- Refer to this video example for more understanding of the concepts
- Refer this articles for more details: