4. Linear Regression: Formulas, Explanation, and a Use-case
Linear Regression is the supervised Machine Learning model in which the model finds the best fit linear line between the independent and dependent variable i.e. it finds the linear relationship between the dependent and independent variable.
It is used for generating continuous values like the price of the house, income, population, etc
The linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable’s value is called the independent variable.
The formula for linear regression analysis:
Why is it called “linear” regression?
Linear suggests that the relationship between dependent and independent variables can be expressed in a straight line.
- Simple Linear Regression uses a single feature (independent variable) to predict a target (dependent variable) by fitting the best linear relationship.
- Multiple Linear Regression uses more than one feature to predict a target variable by fitting the best linear relationship.
Use-case: How price of commodity changes over time
Let’s understand Linear regression with the help of a use-case:
One interesting application of a linear regression model could be to understand how the price of commodity changes over time. Let’s take a look at orange prices in the US for every year since 1980. The prices are shown in USD for a pound of navel oranges.
There are two columns: Year and Price.
A quick glance tells us that prices have gone up each year, but what is harder to tell, is the precise relationship between Year and Price over the last 40 years. Let’s plot these points on a 2D graph with Year on the X-axis and Price on the Y-axis. As we want to predict the Y variable by looking at just the X variable.
The assumption behind linear regression is that the effect of the X variable (Year in this case) is constant over time. That is, the price increase between 1990 and 1991 is the same as the price increase between 2005 and 2006. This is what makes Linear Regression “linear”.
To be clear, in the real world, data is messy and never quite fits a model so cleanly. However, Linear Regression works surprisingly well in many real cases.
Our task is now to find a line that passes through all the points in the data that we observe.
There are many different lines that we can fit into this data. The two parameters that we control are the intercept — the point on the Y-axis where the line intersects it, and the slope — how much the line rises upward as you move along the X-axis. Once we have a regression line with defined parameters – slope and intercept, as shown above, We can easily predict the y_hat (target)
How do we choose the right line to model the data?
To find the line of best fit for our data we can calculate the slope (m, ß1 coefficient) and intercept (b, ß0 coefficient). This will give us the equation of a line that will minimize the distance of our predicted values to the mean of our observed values. This collection of resulting distances is referred to as residuals (a.k.a. errors) and can be used to assess the goodness of fit resulting from our regression.
If our points sat along a perfect line then you could just connect any points and you’d get the right line, but because data tends to be “noisy” we’ll have to do something else. The best line is given by a process called “ordinary least squares” or OLS.
This method says that you want to pick the line that minimizes the distance between every point and the line. The distance between a point and the line is called the “error.” So, we want to reduce the total error of our line.
Minimizing the Error
The residuals for our estimated values can be calculated by subtraction from the observed value. In other words; this tells us how far off each estimated value is from the actual observed value. This is represented by the following formula:
Now that we have a line that has the smallest possible error, shown in the figure below, we can predict the prices of oranges in 2020.
All we have to do is find 2020 on the X-axis, then draw a straight line from there to our linear regression line, then draw a straight line from there to the Y-axis, as shown on the right-hand side.
Now that we have a line that has the smallest possible error, we can predict the prices of oranges in 2020. All we have to do is find 2020 on the X-axis, then draw a straight line from there to our linear regression line, then draw a straight line from there to the Y-axis, as shown on the right-hand side.
Now for something cooler! We can use more than just one variable to predict the prices.
Let’s try using two variables: the year and rainfall in the previous year. Now that we’re in 3 dimensions, we’ll put the year on the x-axis, the previous year’s rainfall on the z-axis, and price on the y axis. Now instead of a line, we’ll fit a plane to the data. The logic is still the same. We need a plane that has minimum error between itself and the data points that it needs to model.
To get the prediction all we need to do is find the year 2020 on the X-axis, the 2019 rainfall on the Z-axis, and find the Y reading! That’s it!!
Evaluation metrics for a linear regression model
Evaluation metrics are a measure of how well a model performs and how well it approximates the relationship.
They’re.: MSE, MAE, R-squared, Adjusted R-squared, and RMSE.
Cons: Linear Regression can be very sensitive to outliers. Suppose you have some faulty data where there’s one point that’s way off, then you’re going to get a line that’s tilted towards that point and not really capturing the real model. So, it’s important to make sure that the data is clean and that outliers are removed before running linear regression.
 Evaluation metrics of linear regression: https://www.aionlinecourse.com/tutorial/machine-learning/evaluating-regression-models-performance