A Simple Intuition for Overfitting, or Why Testing on Training Data is a Bad Idea

Last Updated on August 21, 2016

When you first start out with machine learning you load a dataset and try models. You might think to yourself, why can’t I just build a model with all of the data and evaluate it on the same dataset?

It seems reasonable. More data to train the model is better, right? Evaluating the model and reporting results on the same dataset will tell you how good the model is, right?

Wrong.

In this post you will discover the difficulties with this reasoning and develop an intuition for why it is important to test a model on unseen data.

Train and Test on the Same Dataset

If you have a dataset, say the iris flower dataset, what is the best model of that dataset?

Irises

Irises
Photo by dottieg2007, some rights reserved

The best model is the dataset itself. If you take a given data instance and ask for it’s classification, you can look that instance up in the dataset
To finish reading, please visit source site