Gradient Descent (now with a little bit of scary maths)

Buckle up Buckaroo because Gradient Descent is gonna be a long one (and a tricky one too).
The whole article would be a lot more “mathy” than most articles as it tries to cover the concepts behind a Machine Learning algorithm called Linear Regression.
If you don’t know what Linear Regression is, go through this article once. It would help you understand the basics behind Linear Regression without actually discussing any complex mathematics.
I have already made a Google Colab Notebook covering this topic so if you would want to follow it click here. The notebook teaches you how to recreate this algorithm from scratch, so it’ll be a very good learning experience for you if you are new to the field of Machine Learning.
Flavors of Linear Regression
Simple Linear Regression
Simple Linear Regression is a type of Linear Regression where we only use one variable to predict new outputs.
The equation of a straight line is –

Here m or β1 is the slope of the line (for that variable) and b or β0 is the intercept.
Multiple Linear Regression
If there were more than two terms in the equation above, then we would have been dealing with Multiple Linear Regression. And the equation for the line would have been something like this,

Multiple Linear Regression is the type of Linear Regression where we use multiple variables to predict new outputs.
In the above case, x1, x2,…, xn are the multiple independent variables (feature variables) that affect y, the dependent variable (target variable).
β1,β2,…,βn are the coefficients or the slope for that feature variable, while β0 is still the intercept.
Using the formula for Simple Linear Regression, we can plot a straight line that can try to capture the trend between two variables. All we would need are the values for β0 and β1.
x = [1,2,3,4,5]
y = [3,5,7,9,11]
Let’s take β0 as 0.55 and β1 as 3. Just two random numbers.

As you can see, the line we plotted is not very accurate and it strays off the actual data by quite a lot. Basically, it has a lot of errors.
But how much exactly??????
SSE, MSE, RMSE, and MAE
SSE, MSE, RMSE, and MAE are just different ways of calculating the errors in our predictions.
SSE
SSE stands for Sum of Squared Errors. It basically is the sum of the squares of the difference between the predicted and the actual value.

y_i = Actual y value for that value of x.
y_i_cap = Value predicted by the line we plotted for that value of x.
n = Number of total y values.
Here we are taking the square of y_i − y_i_cap because for some values this might be a positive number, while for some it might be a negative number. As we want the total error values we would want to sum these numbers up. But summing a positive number and a negative number would just cancel out some of the errors. To avoid this we square y_i−y_i_cap as it would only give us positive numbers.
MSE
MSE is the average of the square of the difference between the predicted and the actual value.
If you didn’t understand the sentence above, I hope this formula makes everything clearer.

RMSE
As MSE has squaring up the errors involved we won’t be getting the accurate error values if we only use MSE. To fix this we can take a square root of MSE. This is called RMSE or Root Mean Squared Error.

MAE
We can also take the absolute value of y_true−y_pred by applying mod to avoid getting negative error values. This type of error is called MAE or Mean Absolute Error.

For the line, we plotted MSE, RMSE, and MAE are 8.5, 2.92, and 2.55 respectively.
We need to minimize these errors to have much more accurate predictions. The most ideal value would be 0, but that’s extremely rare.
Cost Function
A cost function is a formula that is used to calculate the cost (errors/loss) for certain values of the original function.
In simple terms, a cost function just measures the errors in our prediction.
Sometimes the Cost Function is also called the Loss Function.
In our case, the Cost Function is the MSE itself. You can take the Cost Function as RMSE or MAE too, but I am going with MSE because it amplifies the errors a bit by squaring them.
Gradient Descent
Let’s take each formula as a mathematical function-
1. Function for predicted values.

(Image 8)
2. Function for errors.

3. Function for Cost Function.

For the highest accuracy possible we want J(β0, β1) ≈ 0.
Phew, that was a lot, but we are halfway there now!
Now let’s plot the errors we get for different values of β0 and β1.


From the graphs above we can see that the Slope vs MSE curve and the Intercept vs MSE curve is a parabola, which for certain values of slope and intercept did reach very close to 0.
Let’s plot a 3D graph for an even clearer picture.

(That’s a cool-looking graph ngl!)
Here we can clearly see that for certain values of β0 and β1, the value of J(β0,β1) becomes the least.
Those are the values we want for our linear equation as they would yield us the highest accuracy possible.
Convergence Theorem

As you can see in the image above we start from a particular coefficient value and take small steps to reach the minimum error values. The size of this small step is called the learning rate.
Because we are taking small steps in order to converge with the minima, this whole process is called as convergence theorem.
The formula is something like this –

Here,
- β0 is the intercept of the linear equation.
- β1 is the slope of the linear equation.
- L is the learning rate.
- (∂J/∂β0) is the slope(derivative) of function J(β0,β1) with respect to β0.
- (∂J/∂β1) is the slope(derivative) of function J(β0,β1) with respect to β1.
With each iteration, we would be getting closer to the ideal value of β0 and β1.
If you feel a bit confused about what this was, click here for better understanding. In the video, he has used a different Cost Function but don’t worry about that. The Cost Function can be anything and it depends from method to method, you just have to make sure that it accurately measures the errors.
Okay cool. Now we are one step closer to implementing gradient descent in code. All we need to do is figure out was the values for (∂J/∂β0) and (∂J/∂β1) are and we are good to go.

First let’s find ∂J/∂β0,

Next, let’s find ∂J/∂β1,

If suppose there was a β2 in the equation (which can happen in Multiple Linear Regression), then (∂J/∂β2) would be,

For any βn in the equation, the ∂J/∂βn would be,

Now that we have the values of ∂J/∂β0 and ∂J/∂β1, we can easily feed them in the formula in Image 15.
With each iteration, we would be getting one step closer to the minimum Cost Function value.
To see how all this can be implemented in code, click here.
If you feel like you understood Gradient Descent, try deriving the formula for Multiple Linear Regression. It won’t be much different from what you did in this article, and it is one fun exercise to practice the concepts you just learned.
Sources –
- Image 13 – https://www.google.com/url?sa=i&url=https%3A%2F%2Ftowardsdatascience.com%2Fquick-guide-to-gradient-descent-and-its-variants-97a7afb33add&psig=AOvVaw0_5VuNnPMHiSbbNWCokuMW&ust=1649479616298000&source=images&cd=vfe&ved=0CAoQjRxqFwoTCNizwMvUg_cCFQAAAAAdAAAAABAD
- All mathematical equations were written using this website.
- The graphs were plotted in python using plotly.