# 3. Embedded Methods

The main goal of feature selection’s embedded method is learning which features are the best in contributing to the accuracy of the machine learning model. They have built-in penalization functions to reduce overfitting:

These encompass the benefits of both the wrapper and filter methods, by evaluating interactions of features but also maintaining reasonable computational cost.

The typical steps for embedded methods involve training a machine learning algorithm using all the features, then deriving the importance of those features according to the algorithm used. Afterward, it can remove unimportant features based on some criteria specific to the algorithm.

It’s implemented by algorithms that have built-in feature selection methods.

Some of the most popular examples of these methods are** LASSO **and** RIDGE regression** which have inbuilt penalization functions to reduce overfitting.

**A. LASSO Regression: **

**LASSO **stands for **L**east **A**bsolute **S**hrinkage and **S**election **O**perator.

**WHY Lasso?**

When we have less or insufficient data, the model suffers from underfitting. Underfitting reduces the accuracy of our machine learning model. Its occurrence simply means that our model does not fit the data well enough.

**Did you ever try to fit in oversized clothes?** A normal Person trying to fit in an extra-large dress refers to the underfitting problem. The same problem occurs in the dataset if you increase the number of features to decrease the cost function.

Underfit happens in linear models when dealing with fewer data. If we cannot get rid of this problem, it affects the model performance. Here, Lasso regression comes into the picture. It reduces the underfitting problem in a dataset by using some metrics.

*L1 regularization adds penalty equivalent to the absolute value of the magnitude of coefficients.*

**What is LASSO?**

- Lasso regression performs L1 regularization.

- Lasso Regression is almost identical to Ridge Regression, the only difference is the absolute value as opposed to the squaring the weights when computing the ridge regression penalty.

- Lasso regression is like linear regression, but it uses a technique
**“shrinkage”**where the coefficients of determination are shrunk towards**zero**to avoid overfitting and make them work better on different datasets.

- This type of regression is used when the dataset shows high multicollinearity or when you want to automate variable elimination and
**feature selection**.

**The Statistics of Lasso Regression?**

d1, d2, d3, etc., represents the distance between the actual data points and the model line in the above graph.

**Least-squares is the sum of squares of the **distance between the points** from the plotted curve.**

In linear regression, the best model is chosen in a way to minimize the least-squares.

While performing lasso regression, we add a penalizing factor to the least-squares. That is, the model is chosen in a way to reduce the below loss function to a minimal value. During the Lasso fitting algorithm, the model tries to minimize the difference between the predicted and estimated value of the observation with the penalty.

**D = least-squares + lambda * summation (absolute values of the magnitude of the coefficients)**

Lasso regression penalty consists of all the estimated parameters. Lambda can be any value between zero to infinity. This value decides how aggressive regularization is performed. It is usually chosen using cross-validation. Lasso penalizes the sum of absolute values of coefficients. As the lambda value increases, coefficients decrease and eventually **become zero**.

**This way, lasso regression eliminates insignificant variables from our model. **