How to Handle Missing Data with Python

Last Updated on August 28, 2020 Real-world data often has missing values. Data can have missing values for a number of reasons such as observations that were not recorded and data corruption. Handling missing data is important as many machine learning algorithms do not support data with missing values. In this tutorial, you will discover how to handle missing data for machine learning with Python. Specifically, after completing this tutorial you will know: How to marking invalid or corrupt values […]

Read more

Why One-Hot Encode Data in Machine Learning?

Last Updated on June 30, 2020 Getting started in applied machine learning can be difficult, especially when working with real-world data. Often, machine learning tutorials will recommend or require that you prepare your data in specific ways before fitting a machine learning model. One good example is to use a one-hot encoding on categorical data. Why is a one-hot encoding required? Why can’t you fit a model on your data directly? In this post, you will discover the answer to […]

Read more

How to Get the Most From Your Machine Learning Data

Last Updated on June 30, 2020 The data that you use, and how you use it, will likely define the success of your predictive modeling problem. Data and the framing of your problem may be the point of biggest leverage on your project. Choosing the wrong data or the wrong framing for your problem may lead to a model with poor performance or, at worst, a model that cannot converge. It is not possible to analytically calculate what data to […]

Read more

How to Remove Outliers for Machine Learning

Last Updated on August 18, 2020 When modeling, it is important to clean the data sample to ensure that the observations best represent the problem. Sometimes a dataset can contain extreme values that are outside the range of what is expected and unlike the other data. These are called outliers and often machine learning modeling and model skill in general can be improved by understanding and even removing these outlier values. In this tutorial, you will discover outliers and how […]

Read more

How to Save and Reuse Data Preparation Objects in Scikit-Learn

Last Updated on June 30, 2020 It is critical that any data preparation performed on a training dataset is also performed on a new dataset in the future. This may include a test dataset when evaluating a model or new data from the domain when using a model to make predictions. Typically, the model fit on the training dataset is saved for later use. The correct solution to preparing new data for the model in the future is to also […]

Read more

How to Perform Feature Selection with Categorical Data

Last Updated on August 18, 2020 Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the target variable. Feature selection is often straightforward when working with real-valued data, such as using the Pearson’s correlation coefficient, but can be challenging when working with categorical data. The two most commonly used feature selection methods for categorical input data when the target variable is also categorical (e.g. classification predictive modeling) are the chi-squared […]

Read more

How to Choose a Feature Selection Method For Machine Learning

Last Updated on August 20, 2020 Feature selection is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model. Statistical-based feature selection methods involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship […]

Read more

How to Transform Target Variables for Regression in Python

Last Updated on August 18, 2020 Data preparation is a big part of applied machine learning. Correctly preparing your training data can mean the difference between mediocre and extraordinary results, even with very simple linear algorithms. Performing data preparation operations, such as scaling, is relatively straightforward for input variables and has been made routine in Python via the Pipeline scikit-learn class. On regression predictive modeling problems where a numerical value must be predicted, it can also be critical to scale […]

Read more

How to Use the ColumnTransformer for Data Preparation

Last Updated on August 18, 2020 You must prepare your raw data using data transforms prior to fitting a machine learning model. This is required to ensure that you best expose the structure of your predictive modeling problem to the learning algorithms. Applying data transforms like scaling or encoding categorical variables is straightforward when all input variables are the same type. It can be challenging when you have a dataset with mixed types and you want to selectively apply data […]

Read more

How to Perform Data Cleaning for Machine Learning with Python

Last Updated on June 30, 2020 Data cleaning is a critically important step in any machine learning project. In tabular data, there are many different statistical analysis and data visualization techniques you can use to explore your data in order to identify data cleaning operations you may want to perform. Before jumping to the sophisticated methods, there are some very basic data cleaning operations that you probably should perform on every single machine learning project. These are so basic that […]

Read more
1 2 3 4 6