How to Train to the Test Set in Machine Learning

Training to the test set is a type of overfitting where a model is prepared that intentionally achieves good performance on a given test set at the expense of increased generalization error. It is a type of overfitting that is common in machine learning competitions where a complete training dataset is provided and where only the input portion of a test set is provided. One approach to training to the test set involves constructing a training set that most resembles […]

Read more

How to Hill Climb the Test Set for Machine Learning

Last Updated on September 27, 2020 Hill climbing the test set is an approach to achieving good or perfect predictions on a machine learning competition without touching the training set or even developing a predictive model. As an approach to machine learning competitions, it is rightfully frowned upon, and most competition platforms impose limitations to prevent it, which is important. Nevertheless, hill climbing the test set is something that a machine learning practitioner accidentally does as part of participating in […]

Read more

How to Prepare Data For Machine Learning

Last Updated on August 16, 2020 Machine learning algorithms learn from data. It is critical that you feed them the right data for the problem you want to solve. Even if you have good data, you need to make sure that it is in a useful scale, format and even that meaningful features are included. In this post you will learn how to prepare data for a machine learning algorithm. This is a big topic and you will cover the […]

Read more

How to Identify Outliers in your Data

Last Updated on August 16, 2020 Bojan Miletic asked a question about outlier detection in datasets when working with machine learning algorithms. This post is in answer to his question. If you have a question about machine learning, sign-up to the newsletter and reply to an email or use the contact form and ask, I will answer your question and may even turn it into a blog post. Kick-start your project with my new book Data Preparation for Machine Learning, […]

Read more

Data Cleaning: Turn Messy Data into Tidy Data

Last Updated on August 16, 2020 Data preparation is difficult because the process is not objective, or at least it does not feel that way. Questions like “what is the best form of the data to describe the problem?” are not objective. You have to think from the perspective of the problem you want to solve and try a few different representations through your pipeline. Hadley Wickham is the Adjunct Professor at Rice University and Chief Scientist and RStudio and […]

Read more

Rescaling Data for Machine Learning in Python with Scikit-Learn

Last Updated on June 30, 2020 Your data must be prepared before you can build models. The data preparation process can involve three steps: data selection, data preprocessing and data transformation. In this post you will discover two simple data transformation methods you can apply to your data in Python using scikit-learn. Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples. Let’s get started. Update: […]

Read more

Improve Model Accuracy with Data Pre-Processing

Last Updated on August 15, 2020 Data preparation can make or break the predictive ability of your model. In Chapter 3 of their book Applied Predictive Modeling, Kuhn and Johnson introduce the process of data preparation. They refer to it as the addition, deletion or transformation of training set data. In this post you will discover the data pre-process steps that you can use to improve the predictive ability of your models. Kick-start your project with my new book Data […]

Read more

Discover Feature Engineering, How to Engineer Features and How to Get Good at It

Last Updated on August 15, 2020 Feature engineering is an informal topic, but one that is absolutely known and agreed to be key to success in applied machine learning. In creating this guide I went wide and deep and synthesized all of the material I could. You will discover what feature engineering is, what problem it solves, why it matters, how to engineer features, who is doing it well and where you can go to learn more and get good […]

Read more

An Introduction to Feature Selection

Last Updated on August 15, 2020 Which features should you use to create a predictive model? This is a difficult question that may require deep knowledge of the problem domain. It is possible to automatically select those features in your data that are most useful or most relevant for the problem you are working on. This is a process called feature selection. In this post you will discover feature selection, the types of methods that you can use and a […]

Read more

Data Leakage in Machine Learning

Last Updated on August 15, 2020 Data leakage is a big problem in machine learning when developing predictive models. Data leakage is when information from outside the training dataset is used to create the model. In this post you will discover the problem of data leakage in predictive modeling. After reading this post you will know: What is data leakage is in predictive modeling. Signs of data leakage and why it is a problem. Tips and tricks that you can […]

Read more
1 2 3 6