Data Preparation for Gradient Boosting with XGBoost in Python

Last Updated on August 27, 2020

XGBoost is a popular implementation of Gradient Boosting because of its speed and performance.

Internally, XGBoost models represent all problems as a regression predictive modeling problem that only takes numerical values as input. If your data is in a different form, it must be prepared into the expected format.

In this post, you will discover how to prepare your data for using with gradient boosting with the XGBoost library in Python.

After reading this post you will know:

How to encode string output variables for classification.
How to prepare categorical input variables using one hot encoding.
How to automatically handle missing data with XGBoost.

Kick-start your project with my new book XGBoost With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Sept/2016: I updated a few small typos in the impute example.
Update Jan/2017: Updated to reflect changes in scikit-learn API version 0.18.1.
Update Jan/2017: Updated breast cancer example to converted input data to strings.
Update Oct/2019: Updated usage of OneHotEncoder to suppress warnings.
Update Dec/2019: Updated example to fix bug in API usage in the multi-class example.
Update May/2020: Updated to
To finish reading, please visit source site

XGBoost