Tour of Data Sampling Methods for Imbalanced Classification

Machine learning techniques often fail or give misleadingly optimistic performance on classification datasets with an imbalanced class distribution.

The reason is that many machine learning algorithms are designed to operate on classification data with an equal number of observations for each class. When this is not the case, algorithms can learn that very few examples are not important and can be ignored in order to achieve good performance.

Data sampling provides a collection of techniques that transform a training dataset in order to balance or better balance the class distribution. Once balanced, standard machine learning algorithms can be trained directly on the transformed dataset without any modification. This allows the challenge of imbalanced classification, even with severely imbalanced class distributions, to be addressed with a data preparation method.

There are many different types of data sampling methods that can be used, and there is no single best method to use on all classification problems and with all classification models. Like choosing a predictive model, careful experimentation is required to discover what works best for your project.

In this tutorial, you will discover a suite of data sampling techniques that can be used to balance an imbalanced classification dataset.

After
To finish reading, please visit source site