Random Oversampling and Undersampling for Imbalanced Classification

Last Updated on August 28, 2020

Imbalanced datasets are those where there is a severe skew in the class distribution, such as 1:100 or 1:1000 examples in the minority class to the majority class.

This bias in the training dataset can influence many machine learning algorithms, leading some to ignore the minority class entirely. This is a problem as it is typically the minority class on which predictions are most important.

One approach to addressing the problem of class imbalance is to randomly resample the training dataset. The two main approaches to randomly resampling an imbalanced dataset are to delete examples from the majority class, called undersampling, and to duplicate examples from the minority class, called oversampling.

In this tutorial, you will discover random oversampling and undersampling for imbalanced classification

After completing this tutorial, you will know:

  • Random resampling provides a naive technique for rebalancing the class distribution for an imbalanced dataset.
  • Random oversampling duplicates examples from the minority class in the training dataset and can result in overfitting for some models.
  • Random undersampling deletes examples from the majority class and can result in losing information invaluable to a model.

Kick-start your project with my new book To finish reading, please visit source site