Besides duplicate features, a dataset can also include correlated features.
“Correlation is defined as a measure of the linear relationship between two quantitative variables.”
A high correlation is often a useful property—if two variables are highly correlated:
- We can predict one from the other. Therefore, we generally look for features that are highly correlated with the target, especially for linear machine learning models.
- They provide redundant information in regards to the target. Essentially, we can make an accurate prediction on the target with just one of the redundant variables.
In these cases, the second variable doesn’t add additional information, so removing it can help to reduce the dimensionality and also the added noise.
There are several methods to measure the correlation between variables—let’s explore the most widely used Pearson correlation.
Correlation between sets of data is a measure of how well they are related. The most common measure of correlation in statistics is Pearson Correlation. The full name is Pearson Product Moment Correlation (PPMC). It shows the linear relationship between two sets of data.
In simple terms, it answers the question, Can I draw a line graph to represent the data?
Two letters are used to represent the Pearson correlation: the Greek letter rho (ρ) for a population and the letter “r” for a sample.
Pearson Correlation Formula
N = the number of pairs of scores Σxy = the sum of the products of paired scores
Σx = the sum of x scores Σy = the sum of y scores
Σx2 = the sum of squared x scores Σy2 = the sum of squared y scores
It’s used to summarize the strength of the linear relationship between two data variables, which can vary between 1 and -1:
- 1 -: a positive correlation: the values of one variable increase as the values of another increase.
- -1 -: a negative correlation: the values of one variable decrease as the values of another increase.
- 0 -: means no linear correlation between the two variables.
- Large positive correlation – Example: As children grow, so do their clothes and shoe sizes.
- Medium positive correlation – Example: As the number of automobiles increases, so does the demand for the fuel variable increase.
- Small negative correlation –Example: The more somebody eats, the less hungry they get.
- Weak / no correlation- Example: An increase in fuel prices leads to lesser people adopting pets.
What does the Pearson correlation coefficient test do?
The Pearson coefficient correlation has a high statistical significance. It looks at the relationship between two variables. It seeks to draw a line through the data of two variables to show their relationship. This linear relationship can be positive or negative.