Data Preprocessing is a technique that is used to convert raw data into clean data. In other words, whenever the data is gathered from different sources, it is collected in raw format which is not feasible for the analysis. Therefore, certain steps are executed to convert the raw data into a clean dataset.
Importance of Data Pre-processing
Let’s take a simple example: A couple goes into a hospital for a pregnancy test — both the man and woman have to go through the test. Once the pregnancy results return, they suggest that the man is pregnant.
Pretty weird, right?
Now try and relate this to a machine learning problem — classification. We have 1000+ couples’ pregnancy test data, and for 60% of the data, we know who’s pregnant. For the remaining 40%, we need to predict the results based on previously recorded tests. Let’s say, out of this 60%, 1% suggests that the man is pregnant.
While building a machine learning model, if we haven’t done any pre-processing like correcting outliers, handling missing values, normalization and scaling of data, or feature engineering, we might end up considering that 1% of results are false. (don’t freak out if you don’t know what these techniques are, we’ll be covering them later)
A machine learning model is nothing but a piece of code; an engineer or data scientist makes it smart through training with data. So if you give garbage to the model, you will get garbage in return, i.e. the trained model will provide false or wrong predictions for the people (40%) whose results are unknown.
This is just one example of incorrect data. People might end up collecting inappropriate values (e.g. negative salary of an employee), sometimes missing values. This can all result in misleading predictions/answers for the unknowns.
This is just an example of incorrect data. People might end up collecting inappropriate values (e.g. negative salary of an employee), sometimes missing values. This can lead to misleading results for the unknowns.
Steps for Data Preprocessing
In the modern era, most of the work relies on data, therefore a collection of large amounts of data for different purposes like academic, scientific research, institutional use, personal and private use, for commercial purposes, and lots more. But this real-world data tend to be incomplete, noisy, and inconsistent. This can lead to poor quality of collected data and further to low quality of models built on such data. In order to address these issues, Data processing of this collected data is essential so that the data goes through all the below-stated steps and gets sorted, stored, filtered, presented in the required format, and analyzed.