XGBoost is a machine learning library originally written in C++ and ported to R in the xgboost R package. Over the last several years, XGBoost’s effectiveness in Kaggle competitions catapulted it in popularity. At Tychobra, XGBoost is our go-to machine learning library.
François Chollet and JJ Allaire summarize the value of XGBoost in the intro to “Deep Learning in R”:
In 2016 and 2017, Kaggle was dominated by two approaches: gradient boosting machines and deep learning. Specifically, gradient boosting is used for problems where structured data is available, whereas deep learning is used for perceptual problems such as image classification. Practitioners of the former almost always use the excellent XGBoost library.
These are the two techniques you should be the most familiar with in order to be successful in applied machine learning today: gradient boosting machines, for shallow-learning problems; and deep learning, for perceptual problems. In technical terms, this means you’ll need to be familiar with XGBoost and Keras—the two libraries that currently dominate Kaggle competitions.
At Tychobra, we have trained XGBoost models using the caret R package created by Max Kuhn. caret has treated us very well over the years (check out our post Machine Learning for Insurance Claims for an example of using xgboost with caret).
Max Kuhn and others at Rstudio have more recently turned their attention from caret to “tidymodels” (the successor to caret). “tidymodels” is a collection of R packages that work together to simplify and supercharge model training and tuning. With the recent launch of tidymodels.org, we felt it was time to give the tidymodels R packages a shot.
In this post we will train and tune an XGBoost model using the tidymodels R packages. We use the AmesHousing dataset which contains housing data from Ames, Iowa. Our model will predict house sale price.
# data
library(AmesHousing)
# data cleaning
library(janitor)
# data prep
library(dplyr)
# tidymodels
library(rsample)
library(recipes)
library(parsnip)
library(tune)
library(dials)
library(workflows)
library(yardstick)
# speed up computation with parrallel processing (optional)
library(doParallel)
all_cores <- parallel::detectCores(logical = FALSE)
registerDoParallel(cores = all_cores)
Load in the Ames housing dataset.
# set the random seed so we can reproduce any simulated results.
set.seed(1234)
# load the housing data and clean names
ames_data <- make_ames() %>%
janitor::clean_names()
At this point we would normally make a few simple plots and summaries of the data to get a high-level understanding of the data. For simplicity, we are going to cut the EDA process from this post, but, in a real-world analysis, understanding the business problems and doing effective EDA are often the most time consuming and crucial aspects of the analysis.
Now we split the data into training and test data. Training data is used for the model training and hyperparameter tuning. Once trained, the model can be evaluated against test data to assess accuracy.
# split into training and testing datasets. Stratify by Sale price
ames_split <- rsample::initial_split(
ames_data,
prop = 0.8,
strata = sale_price
)
Preprocessing alters the data to make our model more predictive and the training process less compute intensive. Many models require careful and extensive variable preprocessing to produce accurate predictions. XGBoost, however, is robust against highly skewed and/or correlated data, so the amount of preprocessing required with XGBoost is minimal. Nevertheless, we can still benefit from some preprocessing.
In tidymodels, we use the recipes package to define these preprocessing steps, in what is called a “recipe”.
# preprocessing "recipe"
preprocessing_recipe <-
recipes::recipe(sale_price ~ ., data = training(ames_split)) %>%
# convert categorical variables to factors
recipes::step_string2factor(all_nominal()) %>%
# combine low frequency factor levels
recipes::step_other(all_nominal(), threshold = 0.01) %>%
# remove no variance predictors which provide no predictive information
recipes::step_nzv(all_nominal()) %>%
prep()
As you can see in the chart below, for the “neighborhood” variable, several of the factor levels with the fewest observations (less than 1% of the total number of observations) have been lumped into an “other” factor level. We did this preprocessing in step_other() in the above recipe.
We apply our previously defined preprocessing recipe with bake(). Then we use cross-validation to randomly split the training data into further training and test sets. We will use these additional cross validation folds to tune our hyperparameters in a later step.
ames_cv_folds <-
recipes::bake(
preprocessing_recipe,
new_data = training(ames_split)
) %>%
rsample::vfold_cv(v = 5)
We use the parsnip package to define the XGBoost model specification. Below we use boost_tree() along with tune() to define the hyperparameters to undergo tuning in a subsequent step.
# XGBoost model specification
xgboost_model <-
parsnip::boost_tree(
mode = "regression",
trees = 1000,
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune()
) %>%
set_engine("xgboost", objective = "reg:squarederror")
We use the tidymodel dials package to specify the parameter set.
# grid specification
xgboost_params <-
dials::parameters(
min_n(),
tree_depth(),
learn_rate(),
loss_reduction()
)
Next we set up the grid space. The dails::grid_* functions support several methods for defining the grid space. We are using the dails::grid_max_entropy() function which covers the hyperparameter space such that any portion of the space has an observed combination that is not too far from it.
xgboost_grid <-
dials::grid_max_entropy(
xgboost_params,
size = 60
)
knitr::kable(head(xgboost_grid))
min_n | tree_depth | learn_rate | loss_reduction |
---|---|---|---|
34 | 1 | 0.0118682 | 29.9649253 |
38 | 12 | 0.0001291 | 0.6156496 |
6 | 7 | 0.0000949 | 0.0000000 |
32 | 4 | 0.0000005 | 0.0000367 |
14 | 2 | 0.0001833 | 0.0000000 |
31 | 8 | 0.0000000 | 1.4345098 |
To tune our model, we perform grid search over our xgboost_grid’s grid space to identify the hyperparameter values that have the lowest prediction error.
We use the new tidymodel workflows package to add a formula to our XGBoost model specification.
xgboost_wf <-
workflows::workflow() %>%
add_model(xgboost_model) %>%
add_formula(sale_price ~ .)
Tuning is where the tidymodels ecosystem of packages really comes together. Here is a quick breakdown of the objects passed to the first 4 arguments of our call to tune_grid() below:
# hyperparameter tuning
xgboost_tuned <- tune::tune_grid(
object = xgboost_wf,
resamples = ames_cv_folds,
grid = xgboost_grid,
metrics = yardstick::metric_set(rmse, rsq, mae),
control = tune::control_grid(verbose = TRUE)
)
In the above code block tune_grid() performed grid search over all our 60 grid parameter combinations defined in xgboost_grid and used 5 fold cross validation along with rmse (Root Mean Squared Error), rsq (R Squared), and mae (Mean Absolute Error) to measure prediction accuracy. So our tidymodels tuning just fit 60 X 5 = 300 XGBoost models each with 1,000 trees all in search of the optimal hyperparameters. Don’t try that on your TI-83!
These are the hyperparameter values which performed best at minimizing RMSE.
xgboost_tuned %>%
tune::show_best(metric = "rmse") %>%
knitr::kable()
min_n | tree_depth | learn_rate | loss_reduction | .metric | .estimator | mean | n | std_err |
---|---|---|---|---|---|---|---|---|
12 | 7 | 0.0346875 | 0.0451186 | rmse | standard | 25561.99 | 5 | 2983.927 |
9 | 13 | 0.0183617 | 0.1042750 | rmse | standard | 25576.99 | 5 | 2687.849 |
23 | 5 | 0.0788798 | 0.7513677 | rmse | standard | 25645.38 | 5 | 2461.057 |
11 | 6 | 0.0091690 | 0.0000001 | rmse | standard | 25669.84 | 5 | 2706.457 |
10 | 2 | 0.0108475 | 0.0000003 | rmse | standard | 25883.23 | 5 | 2828.172 |
Next, isolate the best performing hyperparameter values.
xgboost_best_params <- xgboost_tuned %>%
tune::select_best("rmse")
knitr::kable(xgboost_best_params)
min_n | tree_depth | learn_rate | loss_reduction |
---|---|---|---|
12 | 7 | 0.0346875 | 0.0451186 |
Finalize the XGBoost model to use the best tuning parameters.
xgboost_model_final <- xgboost_model %>%
finalize_model(xgboost_best_params)
Now that we have trained our model, we need to evaluate the model performance. We use test data from step 1 (data that was not used in the model training) to evaluate performance.
We use the rmse (Root Mean Squared Error), rsq (R Squared), and mae (Mean Absolute Value) metrics from the yardstick package in our model evaluation.
First let’s evaluate the metrics on the training data:
train_processed <- bake(preprocessing_recipe, new_data = training(ames_split))
train_prediction <- xgboost_model_final %>%
# fit the model on all the training data
fit(
formula = sale_price ~ .,
data = train_processed
) %>%
# predict the sale prices for the training data
predict(new_data = train_processed) %>%
bind_cols(training(ames_split))
xgboost_score_train <-
train_prediction %>%
yardstick::metrics(sale_price, .pred) %>%
mutate(.estimate = format(round(.estimate, 2), big.mark = ","))
knitr::kable(xgboost_score_train)
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 3,807.24 |
rsq | standard | 1.00 |
mae | standard | 2,747.17 |
And now for the test data:
test_processed <- bake(preprocessing_recipe, new_data = testing(ames_split))
test_prediction <- xgboost_model_final %>%
# fit the model on all the training data
fit(
formula = sale_price ~ .,
data = train_processed
) %>%
# use the training model fit to predict the test data
predict(new_data = test_processed) %>%
bind_cols(testing(ames_split))
# measure the accuracy of our model using `yardstick`
xgboost_score <-
test_prediction %>%
yardstick::metrics(sale_price, .pred) %>%
mutate(.estimate = format(round(.estimate, 2), big.mark = ","))
knitr::kable(xgboost_score)
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 30,217.58 |
rsq | standard | 0.87 |
mae | standard | 15,728.22 |
The above metrics on the test data are significantly worse than our training data metrics, so we know that there is some overfitting going on in our model. This highlights the importance of using test data, rather than training data, to evaluate model performance.
To quickly check that there is not an obvious issue with our model’s predictions, let’s plot the test data residuals.
house_prediction_residual <- test_prediction %>%
arrange(.pred) %>%
mutate(residual_pct = (sale_price - .pred) / .pred) %>%
select(.pred, residual_pct)
ggplot(house_prediction_residual, aes(x = .pred, y = residual_pct)) +
geom_point() +
xlab("Predicted Sale Price") +
ylab("Residual (%)") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::percent)
The above chart does not show any super obvious trends in the residuals. This indicates that, at a very high level, our model is not systematically making inaccurate predictions for houses with certain predicted sale prices. We would do more model validation here for a real-world analysis, but, for the sake of this post, the above chart is good enough for us.
In this post, we were not overly concerned with our model’s performance. Our goal was to simply work through the process of training an XGBoost model using tidymodels, and to learn the tidymodels basics along the way.
Tidymodels gives us a standard process and vocabulary to handle resampling (rsample), data preprocessing (recipes), model specification (parsnip), tuning (tune), and model validation (yardstick). The work done by the tidymodels team to “tidy” the machine learning process is a step change improvement for approachability to machine learning in R; it is easier than ever to train and (more importantly) understand the model training process using the tidymodels packages. Thank you tidymodels team!
We are still just getting started with tidymodels, so please let me know if you see errors or have suggestions for improvements!
If you liked this post, you should check out this excellent post by Business Science which goes into a lot more detail in the EDA process and also trains a glmnet model.