Model Selection using R-squared (R²) Measure

If you are looking for a widely-used measure that describes how powerful a regression is, the R-squared will be your cup of tea.

R² tells you how related two things are. However, we tend to use R² because it’s easier to interpret. R² is the percentage of variation (i.e. varies from 0 to 1) explained by the relationship between two variables.

In the linear regression model, R-squared acts as an evaluation metric to evaluate the scatter of the data points around the fitted regression line. 

An R-squared of zero means our regression line explains none of the variability of the data. An R-squared of 1 would mean our model explains the entire variability of the data.

Range of Regression-line

Formula: Below is the actual formula for calculating the R-Squared value.

R-squared formula
Formula to calculate R²

Where RSS: Residual Sum of Square and TSS: Total Sum of Square

R-Square value can be defined using three other errors terms.

1. Residual Sum of Square (RSS): It is the summation (for all the data points) of the square of the difference between the actual and the predicted value.

Residual Error Formula

2. Total Sum of Squares (TSS): It is the summation (all data points) of the square of the difference between actual output and average value ‘Y(bar)’.

Total Sum of Square Formula

The above is the simplified version for calculating the R-squared value. It uses both the residual sum of the square and the total sum of the square. If your value of R² is large, you have a better chance of your regression model fitting the observations. 

You can have a visual demonstration of the plots of fitted values by observed values in a graphical manner. It illustrates how R-squared values represent the scatter around the regression line. 

Visual Representation of R-squared

As observed in the pictures above, the value of R-squared for the regression model on the left side is 17%, and for the model on the right is 83%. In a regression model, when the variance accounts to be high, the data points tend to fall closer to the fitted regression line.  

However, a regression model with an R² of 100% is an ideal scenario which is actually not possible. In such a case, the predicted values equal the observed values and it causes all the data points to fall exactly on the regression line.

Further Reads:

[1] Refer to this in-depth article on linear regression and the evaluation measure R-squared (R²) by Ajitesh Kumar: he has explained-R²-using python-sklearn-practical implementation.

[2] To know more about R² statistics please refer to this article from Analytics Vidhya.

Similar Posts

Leave a Reply

Your email address will not be published.