This is Part 1 of our article.
In regression analysis, the features are estimated using coefficients while modeling. Also, if the estimates can be restricted, shrunk, or regularized towards zero, then the impact of insignificant features might be reduced and would prevent models from high variance with a stable fit.
Regularization shrinks the coefficient estimates towards zero. This technique adds a penalty to more complex models and discourages learning of more complex models to reduce the chance of over-fitting.
Regularization refers to training our model well enough that it can generalize over data it hasn’t seen before i.e. we can fit our machine learning model appropriately on a given test set and hence reduce the errors in it.
Regularization Techniques: There are two main types of regularization techniques:
In all types of regularization, there is something called a penalty term (the Greek letter lambda: λ). This penalty term is what mathematically shrinks the noise in our data.
a. L1 Lasso Regularization:
The word “LASSO” stands for Least Absolute Shrinkage and Selection Operator. It is a statistical formula for the regularization of data models and feature selection.
A regression model that uses the L1 regularization technique is called Lasso Regression.
Lasso Regression adds “absolute value of magnitude” of coefficient as penalty term to the loss function.
It is used over regression methods for a more accurate prediction. This model uses shrinkage. Shrinkage is where data values are shrunk towards a central point as the mean. It is based on simple models that possess fewer parameters. We get a better interpretation of the models due to the shrinkage process. The shrinkage process also enables the identification of variables strongly associated with variables corresponding to the target.
Lasso = loss + (lambda * l1_penalty)
Here, lambda is the hyper-parameter that has a check at the weighting of the penalty values.
The lasso regression estimate is defined as
Here the turning factor λ controls the strength of penalty, that is
- When λ = 0: We get same coefficients as simple linear regression
- When λ = ∞: All coefficients are zero
- When 0 < λ < ∞: We get coefficients between 0 and that of simple linear regression
So when λ is in between the two extremes, we are balancing the below two ideas.
1. Fitting a linear model of yon X
2. Shrinking the coefficients
b. L2 Ridge Regression:
A regression model that uses a model which uses L2 regularization technique is called Ridge Regression. It adds “squared magnitude” of coefficient as a penalty term to the loss function.
Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will add too much weight and it will lead to under-fitting. Having said that it’s important how lambda is chosen. This technique works very well to avoid overfitting issues.
The Ridge regression is a technique that is specialized to analyze multiple regression data which is multicollinearity in nature.
The main idea of Ridge Regression is to fit a new line that doesn’t fit the training data. In other words, we introduce a certain amount of Bias into the new trend line.
What we do in practice, is to introduce a Bias that we call Lambda,
Penalty Function is: lambda*slope2.
The Lambda is a penalty term and this value is called Ridge Regression or L2.
- λ is the turning factor that controls the strength of the penalty term.
- If λ = 0, the objective becomes similar to simple linear regression. So we get the same coefficients as simple linear regression.
- If λ = ∞, the coefficients will be zero because of infinite weightage on the square of coefficients as anything less than zero makes the objective infinite.
- If 0 < λ < ∞, the magnitude of λ decides the weightage given to the different parts of the objective.
You can also refer to this article:https://neptune.ai/blog/fighting-overfitting-with-l1-or-l2-regularization