How to Remove Outliers for Machine Learning

Last Updated on August 18, 2020

When modeling, it is important to clean the data sample to ensure that the observations best represent the problem.

Sometimes a dataset can contain extreme values that are outside the range of what is expected and unlike the other data. These are called outliers and often machine learning modeling and model skill in general can be improved by understanding and even removing these outlier values.

In this tutorial, you will discover outliers and how to identify and remove them from your machine learning dataset.

After completing this tutorial, you will know:

That an outlier is an unlikely observation in a dataset and may have one of many causes.
How to use simple univariate statistics like standard deviation and interquartile range to identify and remove outliers from a data sample.
How to use an outlier detection model to identify and remove rows from a training dataset in order to lift predictive modeling performance.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update May/2018: Fixed bug when filtering samples via outlier limits.
Update May/2020:
To finish reading, please visit source site

Data Preparation