1. Oversampling Techniques
This method works with minority classes. It replicates the observations from minority classes to balance the data. It is also known as upsampling.
Oversampling can be defined as adding more copies of the minority class ) to obtain a balanced dataset. Oversampling can be a good choice when you don’t have a ton of data to work with. It is appropriate when data scientists do not have enough information. One class is abundant, or the majority, and the other is rare, or the minority. This technique attempts to increment the size of rare samples to create a balance when the data is insufficient.
This method also can be divided into two types: Random Oversampling and Informative Oversampling.
a. Random Oversampling:
- Randomly duplicate examples in the minority class.
- However, it can discard useful data and it may cause overfitting because learning algorithms tend to focus on replicated minority examples.
- This technique can be effective for those machine learning algorithms that are affected by a skewed distribution and where multiple duplicate examples for a given class can influence the fit of the model.
- SMOTE (Synthetic Minority Oversampling Technique) is one of the most commonly used oversampling methods to solve the imbalance problem.
- It aims to balance class distribution by randomly increasing minority class examples by replicating them.
- SMOTE synthesizes new minority instances between existing minority instances. It generates the virtual training records by linear interpolation for the minority class.
- The general idea of SMOTE is the generation of synthetic data between each sample of the minority class and its “k” nearest neighbors. That is, for each one of the samples of the minority class, its “k” nearest neighbors are located (by default k = 5), then between the pairs of points generated by the sample and each of its neighbors, a new synthetic data is generated. In Figure, you can see a visual description of the SMOTE implementation.
As we can see in the figure above, SMOTE is applied to generate synthetic data from x1 considering the 3 nearest neighbors (x2, x3, and x4) to generate the synthetic data s1, s2, and s3.
Although SMOTE is a technique that allows the generation of synthetic tabular data, such an algorithm by itself has some limitations. SMOTE only works with continuous data (that is, it is not designed to generate categorical synthetic data)
On the other hand, the synthetic data generated is linearly dependent, which can cause a bias in the data generated and consequently produce an overfitted model.
c. Ensembling balance bagging classifier :
The main objective of ensemble methodology is to improve the performance of single classifiers. The approach involves constructing several two-stage classifiers from the original data and then aggregating their predictions.
Bagging Based techniques for imbalanced data: A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their predictions (either by voting or by averaging) to form a final prediction.
Bagging refers to the method of randomly sampling training instances with replacement. In statistics, sampling with replacement is also called bootstrapping.
The term “with replacement” means that after one instance is taken randomly from the training set, a replacement of this instance is put into the training set. When the next instance is selected, there is a chance that the next instance selected is the same as the previous instance selected.
Here is a simple example of bagging:
As you can see, the same instance can appear multiple times in the subsample. This is the characteristic of the bagging method.
During bagging, each subsample is used to train one classifier. For each classifier, the samples that are not seen during training is called out-of-bag instances or oob instances:
These OOB instances can be used to evaluate the performance of the classifiers since they serve the same function as a test set – a dataset that is not seen during training. To evaluate an OOB score using the bagging method, we use Scikit-learn’s BaggingClassifier and set oob_score=True.
Bagging is used for reducing Overfitting to create strong learners for generating accurate predictions. Unlike boosting, bagging allows replacement in the bootstrapped sample.