Statistical Significance Tests for Comparing Machine Learning Algorithms

Last Updated on August 8, 2019

Comparing machine learning methods and selecting a final model is a common operation in applied machine learning.

Models are commonly evaluated using resampling methods like k-fold cross-validation from which mean skill scores are calculated and compared directly. Although simple, this approach can be misleading as it is hard to know whether the difference between mean skill scores is real or the result of a statistical fluke.

Statistical significance tests are designed to address this problem and quantify the likelihood of the samples of skill scores being observed given the assumption that they were drawn from the same distribution. If this assumption, or null hypothesis, is rejected, it suggests that the difference in skill scores is statistically significant.

Although not foolproof, statistical hypothesis testing can improve both your confidence in the interpretation and the presentation of results during model selection.

In this tutorial, you will discover the importance and the challenge of selecting a statistical hypothesis test for comparing machine learning models.

After completing this tutorial, you will know: