How to Perform Data Cleaning for Machine Learning with Python

Last Updated on June 30, 2020

Data cleaning is a critically important step in any machine learning project.

In tabular data, there are many different statistical analysis and data visualization techniques you can use to explore your data in order to identify data cleaning operations you may want to perform.

Before jumping to the sophisticated methods, there are some very basic data cleaning operations that you probably should perform on every single machine learning project. These are so basic that they are often overlooked by seasoned machine learning practitioners, yet are so critical that if skipped, models may break or report overly optimistic performance results.

In this tutorial, you will discover basic data cleaning you should always perform on your dataset.

After completing this tutorial, you will know:

  • How to identify and remove column variables that only have a single value.
  • How to identify and consider column variables with very few unique values.
  • How to identify and remove rows that contain duplicate observations.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.