Handling Missing Data

Missing data is basically the values that are missing in our dataset, and that would be meaningful for our machine learning project if observed. In this article, we’ll see how missing data can be anything from missing sequence, incomplete feature, files missing, information incomplete, data entry error, etc. Most datasets in the real world contain such missing data. Please jump to this article if you want to go straightaway to handling missing data case by case.

Before you start cleaning a data set, let’s have a quick look at the data. It’s a small dataset, it highlights a lot of real-world situations that you will encounter.

Example of missing data in a table

1 . Standard Missing Values  

Going back to our o above dataset, let’s take a look at the columns and the rows highlighted with blue :

– In the “Street Number” column – in the third row, there’s an empty cell and in the seventh row, there’s an “NA” value

– In the “own_occupied” column  – in the seventh row, there’s an empty cell.

These are missing values.

2. Non-Standard Missing Values

Sometimes it might be the case where there are missing values that have different formats. If there are multiple users manually entering data, then this is a common problem. Maybe I like to use “n/a” but you like to use “na”.

Let’s take a look at the “Number of Bedrooms” column which has rows highlighted with yellow.

In this column, there are four missing values: missing_values = [“n/a”, “na”, “–“]

So It’s important to recognize these non-standard types of missing values for purposes of summarizing and transforming missing values. If you try and count the number of missing values before converting these non-standard types, you could end up missing a lot of missing values.

3. Unexpected Missing Values

So far we’ve seen standard missing values and non-standard missing values. What if we have an unexpected type?

For example, if our feature is expected to be a string, but there’s a numeric type, then technically this is also a missing value.

Let’s take a look at the “Owner Occupied” column to see what I’m talking about.

In the fourth row, highlighted with pink, there’s the number 12. The response for Owner Occupied should be a string (Y or N), so this numeric type is also a missing value.

Similar Posts

Leave a Reply

Your email address will not be published.