Data Leakage in Machine Learning

Last Updated on August 15, 2020

Data leakage is a big problem in machine learning when developing predictive models.

Data leakage is when information from outside the training dataset is used to create the model.

In this post you will discover the problem of data leakage in predictive modeling.

After reading this post you will know:

  • What is data leakage is in predictive modeling.
  • Signs of data leakage and why it is a problem.
  • Tips and tricks that you can use to minimize data leakage on your predictive modeling problems.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Data Leakage in Machine Learning

Data Leakage in Machine Learning
Photo by DaveBleasdale, some rights reserved.

Goal of Predictive Modeling

The goal of predictive modeling is to develop a model that makes accurate predictions on new data, unseen during training.

This is a hard problem.

It’s hard because we cannot evaluate the model on something we don’t have.

To finish reading, please visit source site