The Z-score normalization is a popular and commonly used feature scaling technique. So let’s understand the feature scaling first and why it is required in the first place.
- It is one of the most important data preprocessing steps in machine learning. As different features in the dataset might have highly different ranges of data points, a lack of scaling leads to numerically larger distance values for algorithms that work on computing distances between features.
- But the ones that don’t compare data across features, are fairly insensitive to the scale of the features, for example, tree-based algorithms.
- Also, Machine learning and deep learning algorithms train and converge more quickly when features are scaled.
Normalization and Standardization are some of the most popular and, at the same time, most confusing feature scaling techniques. Here, we discuss the Z score normalization or standardization technique.
What is Z-score normalization?
- The letter ‘Z’ in Z-score stands for Zeta (6th letter of the Greek alphabet) which comes from the Zeta Model that was originally developed by Edward Altman to estimate the chances of a public company going bankrupt
- Also referred to as zero-mean Normalization, Z-Score helps in the normalization of data. If we normalize the data into a simpler form with the help of Z score normalization, it becomes easy for our human minds to understand it as well. Also, It is a strategy of normalizing data that avoids this outlier issue.
- In this technique, values of a feature are normalized based on the mean and standard deviation of the data. The essence of this technique is the data transformation by bringing values of different features to a common scale where the average/mean is zero and the standard deviation is one. This way, all the features are transformed into the same scaled
- Technically, it measures the standard deviations below or above the mean. Standardization or Z-score normalization does not get affected by outliers because there is no predefined range of transformed features.
- We use the following formula to perform a Z-score normalization on a feature in our dataset.
- x: Original value
- μ: Mean of data
- σ: Standard deviation of data
A normal distribution is shown below and it is estimated that
- 68% of the data points lie between +/- 1 standard deviation
- 95% of the data points lie between +/- 2 standard deviation
- 99.7% of the data points lie between +/- 3 standard deviation
How to Interpreting Z-scores
- The z-score is positive if the value lies above the mean, and negative if it lies below the mean
- A positive z-score says the data point is above average
- A negative z-score says the data point is below average
- A z-score close to 0 says the data point is close to average
- A data point can be considered unusual if its z-score is above 3 or below -3 since the probability of that event is extremely low
Advantages of Z-score Normalization
- It allows a data scientist to understand the probability of a score occurring within the normal distribution of the data
- The Z-score enables us to compare two different scores that are from different normal distributions of the data
Suppose the scores for a certain exam are normally distributed with a mean of 80 and a standard deviation of 4. Find the Z-score for an exam score of 87.
We can use the following steps to calculate the z-score:
- The mean is μ = 80
- The standard deviation is σ = 4
- The individual value we’re interested in is X = 87
Thus, z = (X – μ) / σ = (87 – 80) /4 = 1.75.
Using the SciPy library, we can calculate the Z-score. SciPy library provides scipy.stats.zscore function to calculate this score.
scipy.stats.zscore(a, axis=0, ddof=0, nan_policy=’propagate’)
The following tutorials provide additional information on different normalization techniques.