Handling Missing Data
Missing values in the dataset must be handled before you start any statistical analysis or build a machine learning model.
Let’s look at some techniques to treat the missing value with the help of an example. The 2 tables below give different insights.
The inference from the table on the left with the missing data indicates a lower count for Android Mobile users and iOS Tablet users and a higher Average Transaction Value compared to the inference from the right table with no missing data. The inference from the data with missing values could adversely impact business decisions
Let’s start with the deletion method. We’ll cover the rest in upcoming articles.
1. Deletion Methods
Unless the nature of missing data is ‘Missing completely at random, the best avoidable method in many cases is deletion. Otherwise, we need to delete data either listwise or pairwise.
In this case, rows containing missing variables are deleted. Here, in listwise deletion, the entire observation for User A and User C will be ignored.
In this case, only the missing observations are ignored and analysis is done on variables present. In the above case, 2 separate sample data will be analyzed, one with the combination of User, Device, and Transaction and the other with the combination of User, OS, and Transaction.
In such a case, one won’t be deleting any observation. Each of the samples will ignore the variable which has the missing value in it.
Both the above methods suffer from loss of information. Listwise deletion suffers the maximum information loss compared to Pairwise deletion. But, the problem with pairwise deletion is that even though it takes the available cases, one can’t compare analyses because the sample is different every time.