Imbalanced dataset problem and some techniques to handle this issue-Part1!!
The conventional model evaluation methods do not accurately measure model performance when faced with imbalanced datasets. Imbalance between positive and negative outcomes, a so-called class imbalance, is a problem generally found in medical data. Despite various studies, class imbalance has always been a difficult issue.
Assume Negative class to be of 0 and positive class to be 1's.
The overall performance of any model trained on such data will be constrained by its ability to predict rare points. As majority class will get more importance at the time of learning of algorithm and Minority class will be ignored . So this becomes very important to resolve this problem before running any algorithm on such dataset.
One approach to addressing the problem of class imbalance is to randomly resample the training dataset. The two main approaches to randomly resampling an imbalanced dataset are :
- >to delete examples from the majority class, called Random underdamping
- ->to duplicate examples from the minority class, called Random oversampling.
when we have vastly more negative class than positive class, then we can do oversampling . But there are two disadvantages of this.
- It will have duplicate data.
- It will result in Overfitting.
One efficient way of Oversampling is is SMOTE(Synthetic Minority Oversampling Technique).
SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line. SMOTE uses nearest neighbors of the minority class to create synthetic data
Dealing with imbalanced datasets entails strategies such as improving classification algorithms or balancing classes in the training data (data preprocessing) before providing the data as input to the machine learning algorithm. The later technique is preferred as it has wider application.
As we have discussed above details of some techniques. Below is summary with names to handle issue of imbalanced dataset.
Some of the common techniques are:
- Collect more data
- Random Under-Sampling of the dominant class
- Random Over-Sampling of the non- dominant class
- Cluster-Based Over Sampling
- Algorithmic Ensemble Techniques
- Bagging Based techniques for imbalanced data
I will cover each topic in detail in next posts . Keep checking !!!
I hope this topic give you basic idea of Imbalanced class issue and names of techniques to handle the same.
Keep Reading !!! Happy Learning!!