So far, we have seen how we can use deletion methods and imputation methods to handle missing values in a dataset. These univariate methods used for missing value imputation are simplistic ways of estimating the value and may not always provide an accurate picture.
For example, let us say we have variables related to the density of cars on road and levels of pollutants in the air and few observations are missing for the level of pollutants, imputing the level of pollutants by mean/median level of pollutants may not necessarily be an appropriate strategy.
In such scenarios, algorithms or model-based approaches can help to impute the values of missing data. We can use a model to predict the missing values. This requires a model to be created for each input variable that has missing values. Anyone among a range of different models can be used to predict the missing value.
a. Nearest Neighbor Imputation :
- In this method, nearest neighbors are chosen based on some distance measure and their average is used as an imputation estimate.
- KNN can predict both discrete attributes (the most frequent value among the k nearest neighbors) and continuous attributes (the mean among the k nearest neighbors)
Suppose, you run out of stock of necessary food items in your house, and due to the lockdown none of the nearby stores is open. Therefore, you ask your neighbours for help and you will end up cooking whatever they supply to you. This is an example of imputation from a 1-nearest neighbour (taking the help of your closest neighbour).
Instead, if you identify 3 neighbours from whom you ask for help and choose to combine the items supplied by 3 of your nearest neighbours, that is an example of imputation from 3-nearest neighbours.
Similarly, missing values in datasets can be imputed with the help of values of observations from the k-Nearest Neighbours in your dataset. Neighbouring points for a dataset are identified by certain distance metrics. The idea in kNN methods is to identify ‘k’ samples in the dataset that are similar or close in space. Then we use these ‘k’ samples to estimate the value of the missing data points. Each sample’s missing values are imputed using the mean value of the ‘k’-neighbours found in the dataset.
Consider the above diagram that represents the working of kNN. In this case, the oval area represents the neighbouring points of the green squared data point. We use a measure of distance to identify the neighbours.
The distance metric varies according to the type of data:
1. Continuous Data: The commonly used distance metrics for continuous data are Euclidean, Manhattan, and Cosine.
2. Categorical Data: Hamming distance is generally used in this case. It takes all the categorical attributes and for each, count one if the value is not the same between two points. The Hamming distance is then equal to the number of attributes for which the value was different.
b. Random Forest Imputation:
KNN is a machine-learning-based imputation algorithm that has seen success but requires tuning of the parameter k and additionally, is vulnerable to many of KNN’s weaknesses, like being sensitive to being outliers and noise.
MissForest is another machine learning-based data imputation algorithm that operates on the Random Forest algorithm.
First, the missing values are filled in using median/mode imputation. Then, we mark the missing values as ‘Predict’ and the others as training rows, which are fed into a Random Forest model trained to predict, in this case, Age based on Score. The generated prediction for that row is then filled in to produce a transformed dataset.
This process of looping through missing data points repeats several times, each iteration improving on better and better data. It’s like standing on a pile of rocks while continually adding more to raise yourself: the model uses its current position to elevate itself further. The model may decide in the following iterations to adjust predictions or to keep them the same.
Iterations continue until some stopping criteria are met or after a certain number of iterations has elapsed. As a general rule, datasets become well imputed after four to five iterations, but it depends on the size and amount of missing data.