So… Ridge Regression is a modified version of Linear Regression. and a classic example of regularization using L2 penalty. So to learn about Ridge Regression, you have to make sure you understand Linear Regression. If you don’t then click here. If you don’t know what Gradient Descent is, then click here.
It is an absolute must that you know both concepts before proceeding with this article. Take your time to learn these things, I’ll wait for you here.
Cool. Let’s proceed then.
Oh yeah, one more thing. I made this Kaggle Notebook which has interactive code in it. If you want to see how you can code Ridge Regression in Python from scratch, click here. It’s pretty good not gonna lie. And don’t worry the content in it is not inferior to this article. But if you only want to learn the concepts of Ridge Regression and how it works, then no problems just follow this article (I won’t know if you skipped the notebook or not anyways).
Why Ridge Regression?
As I mentioned above, Ridge Regression is just a modified version of Linear Regression. It’s something new…but not something entirely new. But why did Ridge Regression even come into existence and why did we all just accept it?
Well…the answer is pretty simple.
In Linear Regression you need a lot of data to make accurate predictions. But if you only have a small subset of the original data, then your predictions would be pretty whack (inaccurate). Ridge Regression solves this by allowing us to make accurate predictions even if we have very limited data. Let’s take an example of this.
Suppose you have two lists x and y. x = [1, 2, 5, 6, 8, 9, 12, 14] and y = [3, 6, 8, 4, 9, 12, 9, 12].
If we plot a line of best fit for this data using Linear Regression with Gradient Descent (we discussed it in this article), it would look something like this,
But suppose we didn’t have the whole data, but only a subset of it. Like the first two items from x and y. Because we have a lot less data here compared to the previous example, we can assume that the prediction from this model won’t be very accurate.
Let’s plot the line of best fit we get from the model trained on this subset.
As you can see from the two graphs above…this new line of best fit which was calculated using a small subset of the original data is not very accurate. It strays off from the original data by a lot and is nowhere close to the original line of best fit.
But if we plotted this new line of best fit using Ridge Regression, we’ll be able to prevent this. But how would we be able to prevent this and what’s the logic behind it? For this question, let me introduce three new terms. Bias, Variance, and Bias-Variance tradeoff.
Bias and Variance
Bias and variance are both just fancy terms for pretty simple concepts.
Bias is the error we get from making predictions on the training data, while variance is the error we get from making predictions on the testing data.
It is impossible for a Machine Learning model to have 0 bias and 0 variance, but we can surely minimize them to their least possible values.
If our data has both high bias and high variance, then our model is underfitting the data. Meaning that our model is unable to capture the trend the data has and is yielding bad predictions on both the training and the testing data.
If our data has a low bias but high variance, then our model is overfitting the data. Meaning that our model got way too comfortable with the training data but when introduced to new data, it is unable to adapt to it. A big reason for overfitting is because our model is even trying to predict the outliers correctly.
The best type of model is one that has both low bias and low variance, that is a case in which our model is neither underfitting the data, nor overfitting it.
As I said earlier, having a model which has both 0 bias and 0 variance is impossible. But we can surely have the best model by just making sure that the bias and variance are minimized to the least possible value. This can be done by increasing or decreasing the bias/variance of a model and seeing how it affects its variance/bias.
Let’s take the example of overfitting.
In an overfitting model, we see that the model is very good at predicting the training data but very inaccurate at predicting the testing data. And the main reason for this is because the model got way too comfortable with the training data.
So to fix this all we need to do is make the model a little less accurate at making predictions on the training data. That by itself would allow the model to adapt to new testing data.
In the image above you can see that the green line fits the training data perfectly and is even accounting for the outliers. But because it’s way too specific and doesn’t follow a particular trend, it’s not very good at predicting new values. In contrast, the black line has a clear trend even though it misclassifies a few values. This trend itself makes it more adaptable to new data compared to the green line.
The green line has a very low bias but high variance. But in the case of the black line, even though it has a higher bias than the green line, this high bias allows it to be more adaptable to new data which decreases its variance.
In the above example, we were able to decrease the variance by increasing the bias. This is what is called a bias-variance tradeoff.
Ridge Regression does the exact same thing. Even though the line made using the training data doesn’t fit it as nicely as the line made using Linear Regression, this line would be better at adapting to newer data compared to the other one. We’ll see this in more detail in just some time.
So the only difference in Ridge Regression when compared to Linear Regression is the Cost Function. If you remember Gradient Descent, then you can probably recall how important of a role the Cost function played in the prediction.
If you remember, in Gradient Descent we use MSE as the cost function.
For the highest possible accuracy, we want to minimize the cost function i.e. J(β0,β1)≈0.
For Ridge Regression, we’ll change this formula a little.
This extra term, λ(β21), that has been added to the Cost Function for Gradient Descent is called penalization.
Here λ is called the penalization factor. If the value for lambda is set to a very large number like 100000, then the slope of the best fit line would be very close to 0. Not exactly zero, but very close to it.
This new term penalizes the large slope values by giving those values a high Cost Function value. This is done because large slope values can be a sign of overfitting. For large slope values, β1 would be a large number. Meaning the whole term λ(β21) would be a large number, which would in turn affect the Cost Function. Meaning, that for large values of β1 our Cost Function won’t be minimized.
Let’s use this new Cost Function for plotting a line of best fit on the subset of the data.
You can see that the new line we got using Ridge Regression is much different from the older one, even though both were trained on the same subset of the data. The new Cost Function seems to be doing something for sure. The two lines here are like your average pair of siblings. Even though both grew up in the same environment, the younger one did better than the older one.
Bad analogies and failed attempts to make a good joke aside, let’s plot this line of the best line we got from Ridge Regression on the whole data and compare it to the line of the best fit we got from the whole data.
You can see that the line of best fit we got from applying Ridge Regression on the subset of the data and the line of best fit we got by applying Linear Regression on the whole data nearly overlap. That’s good news!
So…that’s Ridge Regression in a nutshell. It isn’t difficult if you already understand Gradient Descent, but if you still had some difficulties don’t worry. Everyone learns at a different rate. Go through the notebook once more, I am sure you’ll understand it!
- Kaggle Notebook – https://www.kaggle.com/code/slyofzero/ridge-regression-from-scratch
- Youtube video on Ridge Regression by StatQuest – https://youtu.be/Q81RR3yKn30