Statistical Imputation for Missing Values in Machine Learning

Last Updated on August 18, 2020

Datasets may have missing values, and this can cause problems for many machine learning algorithms.

As such, it is good practice to identify and replace missing values for each column in your input data prior to modeling your prediction task. This is called missing data imputation, or imputing for short.

A popular approach for data imputation is to calculate a statistical value for each column (such as a mean) and replace all missing values for that column with the statistic. It is a popular approach because the statistic is easy to calculate using the training dataset and because it often results in good performance.

In this tutorial, you will discover how to use statistical imputation strategies for missing data in machine learning.

After completing this tutorial, you will know:

  • Missing values must be marked with NaN values and can be replaced with statistical measures to calculate the column of values.
  • How to load a CSV value with missing values and mark the missing values with NaN values and report the number and percentage of missing values for each column.
  • How to impute missing values with statistics as a data preparation method when
    To finish reading, please visit source site