How to Handle Missing Data with Python

Last Updated on August 28, 2020

Real-world data often has missing values.

Data can have missing values for a number of reasons such as observations that were not recorded and data corruption.

Handling missing data is important as many machine learning algorithms do not support data with missing values.

In this tutorial, you will discover how to handle missing data for machine learning with Python.

Specifically, after completing this tutorial you will know:

  • How to marking invalid or corrupt values as missing in your dataset.
  • How to remove rows with missing data from your dataset.
  • How to impute missing values with mean values in your dataset.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Note: The examples in this post assume that you have Python 3 with Pandas, NumPy and Scikit-Learn installed, specifically scikit-learn version 0.22 or higher. If you need help setting up your environment see this tutorial.

  • Update Mar/2018: Changed link to dataset files.
  • Update Dec/2019: Updated link to dataset to GitHub version.
  • Update May/2020: Updated code examples for API changes. Added references.
To finish reading, please visit source site