| |

Gradient Descent (now with a little bit of scary maths)

gradient descent graph

Buckle up Buckaroo because Gradient Descent is gonna be a long one (and a tricky one too).

The whole article would be a lot more “mathy” than most articles as it tries to cover the concepts behind a Machine Learning algorithm called Linear Regression.

If you don’t know what Linear Regression is, go through this article once. It would help you understand the basics behind Linear Regression without actually discussing any complex mathematics.

I have already made a Google Colab Notebook covering this topic so if you would want to follow it click here. The notebook teaches you how to recreate this algorithm from scratch, so it’ll be a very good learning experience for you if you are new to the field of Machine Learning.

Flavors of Linear Regression

Simple Linear Regression

Simple Linear Regression is a type of Linear Regression where we only use one variable to predict new outputs.

The equation of a straight line is –

Gradient Descent - Simple Linear Regression
Gradient Descent – Simple Linear Regression (Image 1)

Here m or β1 is the slope of the line (for that variable) and b or β0 is the intercept.

Multiple Linear Regression

If there were more than two terms in the equation above, then we would have been dealing with Multiple Linear Regression. And the equation for the line would have been something like this,

Gradient Descent - Multiple Linear Regression
Gradient Descent – Multiple Linear Regression (Image 2)

Multiple Linear Regression is the type of Linear Regression where we use multiple variables to predict new outputs.

In the above case, x1, x2,…, xn are the multiple independent variables (feature variables) that affect y, the dependent variable (target variable).

β1,β2,…,βn are the coefficients or the slope for that feature variable, while β0 is still the intercept.

Using the formula for Simple Linear Regression, we can plot a straight line that can try to capture the trend between two variables. All we would need are the values for β0 and β1.

x = [1,2,3,4,5]

y = [3,5,7,9,11]

Let’s take β0 as 0.55 and β1 as 3. Just two random numbers.

Gradient Descent - Random line of best fit
Gradient Descent – Random line of best fit (Click here to view this chart on Chart Studio) (Image 3)

As you can see, the line we plotted is not very accurate and it strays off the actual data by quite a lot. Basically, it has a lot of errors.

But how much exactly??????


SSE, MSE, RMSE, and MAE are just different ways of calculating the errors in our predictions.


SSE stands for Sum of Squared Errors. It basically is the sum of the squares of the difference between the predicted and the actual value.

Gradient Descent - SSE
Gradient Descent – SSE (Image 4)

y_i = Actual y value for that value of x.
y_i_cap = Value predicted by the line we plotted for that value of x.
n = Number of total y values.

Here we are taking the square of y_i − y_i_cap because for some values this might be a positive number, while for some it might be a negative number. As we want the total error values we would want to sum these numbers up. But summing a positive number and a negative number would just cancel out some of the errors. To avoid this we square y_i−y_i_cap as it would only give us positive numbers.


MSE is the average of the square of the difference between the predicted and the actual value.

If you didn’t understand the sentence above, I hope this formula makes everything clearer.

Gradient Descent - MSE
Gradient Descent – MSE (Image 5)


As MSE has squaring up the errors involved we won’t be getting the accurate error values if we only use MSE. To fix this we can take a square root of MSE. This is called RMSE or Root Mean Squared Error.

Gradient Descent - RMSE
Gradient Descent – RMSE (Image 6)


We can also take the absolute value of y_true−y_pred by applying mod to avoid getting negative error values. This type of error is called MAE or Mean Absolute Error.

Gradient Descent - MAE
Gradient Descent – MAE (Image 7)

For the line, we plotted MSERMSE, and MAE are 8.5, 2.92, and 2.55 respectively.

We need to minimize these errors to have much more accurate predictions. The most ideal value would be 0, but that’s extremely rare.

Cost Function

A cost function is a formula that is used to calculate the cost (errors/loss) for certain values of the original function.

In simple terms, a cost function just measures the errors in our prediction.

Sometimes the Cost Function is also called the Loss Function.

In our case, the Cost Function is the MSE itself. You can take the Cost Function as RMSE or MAE too, but I am going with MSE because it amplifies the errors a bit by squaring them.

Gradient Descent

Let’s take each formula as a mathematical function-

1. Function for predicted values.

Gradient Descent - Prediction Function
Gradient Descent – Prediction Function
(Image 8)

2. Function for errors.

Gradient Descent - MSE Function
Gradient Descent – MSE Function (Image 9)

3. Function for Cost Function.

Gradient Descent - Cost Function
Gradient Descent – Cost Function (Image 10)

For the highest accuracy possible we want J(β0, β1) ≈ 0.

Phew, that was a lot, but we are halfway there now!

Now let’s plot the errors we get for different values of β0 and β1.

Gradient Descent - Slope vs MSE
Gradient Descent – Slope vs MSE (Click here to view the graph on Chart Studio) (Image 11)
Gradient Descent - Slope vs MSE
Gradient Descent – Slope vs MSE (Click here to view the graph on Chart Studio) (Image 12)

From the graphs above we can see that the Slope vs MSE curve and the Intercept vs MSE curve is a parabola, which for certain values of slope and intercept did reach very close to 0.

Let’s plot a 3D graph for an even clearer picture.

Gradient Descent - 3D Graph of Slope vs Intercept vs MSE
Gradient Descent – 3D Graph of Slope vs Intercept vs MSE (Click here to view the graph on Chart Studio) (Image 13)

(That’s a cool-looking graph ngl!)

Here we can clearly see that for certain values of β0 and β1, the value of J(β0,β1) becomes the least.

Those are the values we want for our linear equation as they would yield us the highest accuracy possible.

Convergence Theorem

Gradient Descent (Image 14)

As you can see in the image above we start from a particular coefficient value and take small steps to reach the minimum error values. The size of this small step is called the learning rate.

Because we are taking small steps in order to converge with the minima, this whole process is called as convergence theorem.

The formula is something like this –

Gradient Descent - Convergence Theorem
Gradient Descent – Convergence Theorem (Image 15)


  1. β0 is the intercept of the linear equation.
  2. β1 is the slope of the linear equation.
  3. L is the learning rate.
  4. (∂J/∂β0) is the slope(derivative) of function J(β0,β1) with respect to β0.
  5. (∂J/∂β1) is the slope(derivative) of function J(β0,β1) with respect to β1.

With each iteration, we would be getting closer to the ideal value of β0 and β1.

If you feel a bit confused about what this was, click here for better understanding. In the video, he has used a different Cost Function but don’t worry about that. The Cost Function can be anything and it depends from method to method, you just have to make sure that it accurately measures the errors.

Okay cool. Now we are one step closer to implementing gradient descent in code. All we need to do is figure out was the values for (∂J/∂β0) and (∂J/∂β1) are and we are good to go.

Gradient Descent - Cost Function
6Gradient Descent – Cost Function (Image 16)

First let’s find ∂J/∂β0,

Gradient Descent - ∂J/∂β0
Gradient Descent – ∂J/∂β0 (Image 17)

Next, let’s find ∂J/∂β1,

Gradient Descent - ∂J/∂β1
Gradient Descent – ∂J/∂β1 (Image 18)

If suppose there was a β2 in the equation (which can happen in Multiple Linear Regression), then (∂J/∂β2) would be,

Gradient Descent - ∂J/∂β2
Gradient Descent – ∂J/∂β2 (Image 19)

For any βn in the equation, the ∂J/∂βn would be,

colab.research.google.com/drive/106Il6wzeXvx8Edsg7T0lxGSQ5cOcLOsm?usp=sharing#scrollTo=7LCLhnUyN7SH(opens in a new tab)

Gradient Descent - ∂J/∂βn
Gradient Descent – ∂J/∂βn (Image 20)

Now that we have the values of ∂J/∂β0 and ∂J/∂β1, we can easily feed them in the formula in Image 15.

With each iteration, we would be getting one step closer to the minimum Cost Function value.

To see how all this can be implemented in code, click here.

If you feel like you understood Gradient Descent, try deriving the formula for Multiple Linear Regression. It won’t be much different from what you did in this article, and it is one fun exercise to practice the concepts you just learned.

Sources –

  1. Image 13 – https://www.google.com/url?sa=i&url=https%3A%2F%2Ftowardsdatascience.com%2Fquick-guide-to-gradient-descent-and-its-variants-97a7afb33add&psig=AOvVaw0_5VuNnPMHiSbbNWCokuMW&ust=1649479616298000&source=images&cd=vfe&ved=0CAoQjRxqFwoTCNizwMvUg_cCFQAAAAAdAAAAABAD
  2. All mathematical equations were written using this website.
  3. The graphs were plotted in python using plotly.

Similar Posts

Leave a Reply

Your email address will not be published.