How to Calculate Bootstrap Confidence Intervals For Machine Learning Results in Python

Last Updated on August 14, 2020 It is important to both present the expected skill of a machine learning model a well as confidence intervals for that model skill. Confidence intervals provide a range of model skills and a likelihood that the model skill will fall between the ranges when making predictions on new data. For example, a 95% likelihood of classification accuracy between 70% and 75%. A robust way to calculate confidence intervals for machine learning algorithms is to […]

Introduction to Random Number Generators for Machine Learning in Python

Last Updated on July 31, 2020 Randomness is a big part of machine learning. Randomness is used as a tool or a feature in preparing data and in learning algorithms that map input data to output data in order to make predictions. In order to understand the need for statistical methods in machine learning, you must understand the source of randomness in machine learning. The source of randomness in machine learning is a mathematical trick called a pseudorandom number generator. […]

How to Calculate Correlation Between Variables in Python

Last Updated on August 20, 2020 There may be complex and unknown relationships between the variables in your dataset. It is important to discover and quantify the degree to which variables in your dataset are dependent upon each other. This knowledge can help you better prepare your data to meet the expectations of machine learning algorithms, such as linear regression, whose performance will degrade with the presence of these interdependencies. In this tutorial, you will discover that correlation is the […]

A Gentle Introduction to Calculating Normal Summary Statistics

Last Updated on August 8, 2019 A sample of data is a snapshot from a broader population of all possible observations that could be taken of a domain or generated by a process. Interestingly, many observations fit a common pattern or distribution called the normal distribution, or more formally, the Gaussian distribution. A lot is known about the Gaussian distribution, and as such, there are whole sub-fields of statistics and statistical methods that can be used with Gaussian data. In […]

A Gentle Introduction to the Law of Large Numbers in Machine Learning

Last Updated on August 8, 2019 We have an intuition that more observations is better. This is the same intuition behind the idea that if we collect more data, our sample of data will be more representative of the problem domain. There is a theorem in statistics and probability that supports this intuition that is a pillar of both of these fields and has important implications in applied machine learning. The name of this theorem is the law of large […]

A Gentle Introduction to the Central Limit Theorem for Machine Learning

Last Updated on January 14, 2020 The central limit theorem is an often quoted, but misunderstood pillar from statistics and machine learning. It is often confused with the law of large numbers. Although the theorem may seem esoteric to beginners, it has important implications about how and why we can make inferences about the skill of machine learning models, such as whether one model is statistically better than another and confidence intervals on models skill. In this tutorial, you will […]

Statistics Books for Machine Learning

Last Updated on August 14, 2020 Statistical methods are used at each step in an applied machine learning project. This means it is important to have a strong grasp of the fundamentals of the key findings from statistics and a working knowledge of relevant statistical methods. Unfortunately, statistics is not covered in many computer science and software engineering degree programs. Even if it is, it may be taught in a bottom-up, theory-first manner, making it unclear which parts are relevant […]

A Gentle Introduction to Nonparametric Statistics

Last Updated on November 10, 2019 A large portion of the field of statistics and statistical methods is dedicated to data where the distribution is known. Samples of data where we already know or can easily identify the distribution of are called parametric data. Often, parametric is used to refer to data that was drawn from a Gaussian distribution in common usage. Data in which the distribution is unknown or cannot be easily identified is called nonparametric. In the case […]

A Gentle Introduction to Normality Tests in Python

Last Updated on August 8, 2019 An important decision point when working with a sample of data is whether to use parametric or nonparametric statistical methods. Parametric statistical methods assume that the data has a known and specific distribution, often a Gaussian distribution. If a data sample is not Gaussian, then the assumptions of parametric statistical tests are violated and nonparametric statistical methods must be used. There are a range of techniques that you can use to check if your […]

A Gentle Introduction to Statistical Hypothesis Testing

Last Updated on April 10, 2020 Data must be interpreted in order to add meaning. We can interpret data by assuming a specific structure our outcome and use statistical methods to confirm or reject the assumption. The assumption is called a hypothesis and the statistical tests used for this purpose are called statistical hypothesis tests. Whenever we want to make claims about the distribution of data or whether one set of results are different from another set of results in […]

