Chapter 10 Tree-Based Models — Bootstrapping Explained

This is an Open Access web version of the book Practical Machine Learning with R published by Chapman and Hall/CRC. The content from this website or any part of it may not be copied or reproduced. All copyrights are reserved.

If you find an error, have an idea for improvement, or have a question, please visit the forum for the book. You find additional resources for the book on its companion website, and you can ask the experimental AI assistant about topics covered in the book.

If you enjoy reading the online version of Practical Machine Learning with R, please consider ordering a paper, PDF, or Kindle version to support the development of the book and help me maintain this free online version.

Tree-based models are a type of machine learning technique that uses a tree-like structures to make predictions. The most basic type of a tree-based model is a Decision Tree. A Decision Tree guides observation through a tree-like structure with many branches. The location where a specific observation ends up determines the prediction (more about Decision Trees in Section 10.3).

Other tree-based models are based on a combination of Decision Trees. Because they combine an ensemble of machine learning models (i.e., Decision Trees), they fall in the category of ensemble models. Tree-based models can be used for classification and regression tasks.

Since tree-based models are built based on Decision Trees, you will learn about Decision Trees in Section 10.3. Section 10.3.1 introduces the idea behind Decision Trees, and in the interactive Section 10.3.2, you can create your own Decision Tree model for a classification task. In that section, you will use a Decision Tree model to predict the survival of passengers on the Titanic.

How an ensemble of Decision Trees can be combined into a Random Forest will be covered in Section 10.4. While the idea behind a Random Forest model is introduced in Section 10.4.1, we will show a real-world application in Section 10.4.2. There, we will use a Random Forest model together with tidymodels to estimate the COVID-19 vaccination rate for each county in the continental U.S.

In Section 10.5, we will introduce several Boosting Trees algorithms, which are also ensemble models because they combine multiple Decision Trees. Boosting Trees are more advanced and more recent algorithms based on Decision Trees. We will briefly introduce the ideas behind some Boosting Trees algorithms such as AdaBoost, LightGBM, and CATBoost in Section 10.5. Later, in Section 10.5.1, the idea behind another Boosting Trees algorithm called Gradient Boosting will be explained in more detail, and in the interactive Section 10.5.2 you will use XGBoost, which is a computational advanced version of Gradient Boosting for a real-world application. You will again estimate the COVID-19 vaccination rate for the continental U.S. counties.

10.1 Learning Outcomes

This section outlines what you can expect to learn in this chapter. In addition, the corresponding section number is included for each learning outcome to help you to navigate the content, especially when you return to the chapter for review.

In this chapter, you will learn:

How you can guide an observation through a Decision Tree (see Section 10.3.1).
How to use decision rules when observations are guided through a Decision Tree (see Section 10.3.1).
How you can use a training dataset to train a Decision Tree (see Section 10.3.1).
How to use a trained Decision Tree to make predictions for new observations (see Section 10.3.1).
How to interpret the structure of a trained Decision Tree to gain insight into the underlying causality implications (see Section 10.3.1).
How sensitive the structure of a Decision Tree can react to small changes in the data or the hyper-parameters (see Section 10.3.2).
How several Decision Trees can be combined into an ensemble model like a Random Forest (see Section 10.4.1).
How you can use the Subspace Method and Bagging to create slightly different Decision Trees for a Random Forest (see Section 10.4.1).
How you can use Bootstrapping to create multiple Bootstrap samples and Out of Bag samples from one training dataset (see Section 10.4.1).
How to use a Random Forest model to predict vaccination behavior for the continental U.S. counties (see Section 10.4.2).
How you can use Gradient Boosting to combine Decison Trees into an ensemble where each Decision Tree adjusts for the errors of the previous Decision Tree (see Section 10.5).
How you use an interactive project to predict vaccination behavior with the XGBoost algorithm for the continental U.S. counties (see Section 10.5.2).

10.2 R Packages Required for the Chapter

This section lists the R packages that you need when you load and execute code in the interactive sections in RStudio. Please install the following packages using Tools -> Install Packages \(\dots\) from the RStudio menu bar (you can find more information about installing and loading packages in Section 3.4):

The rio package (Chan et al. (2021)) to enable the loading of various data formats with one import() command. Files can be loaded from the user’s hard drive or the Internet.
The janitor package (Firke (2023)) to rename variable names to UpperCamel and to substitute spaces and special characters in variable names.
The tidymodels package (Kuhn and Wickham (2020)) to streamline data engineering and machine learning tasks.
The kableExtra (Zhu (2021)) package to support the rendering of tables.
The learnr package (Aden-Buie, Schloerke, and Allaire (2022)), which is needed together with the shiny package (Chang et al. (2022)) for the interactive exercises in this book.
The shiny package (Chang et al. (2022)), which is needed together with the learnr package (Aden-Buie, Schloerke, and Allaire (2022)) for the interactive exercises in this book.
The rpart package (Therneau and Atkinson (2022)) to create Decision Trees.
The rpart.plot package (Milborrow (2022)) to plot Decision Trees.
The ranger package (Wright and Ziegler (2017)) to create a Random Forest.
The xgboost package (Chen et al. (2023)) to perform Extreme Gradient Boosting (XGBoost).
The parallel package (R Core Team (2022)) for parallel processing the Random Forest algorithm.
The doParallel package (Corporation and Weston (2022)) for parallel processing the XGBoost algorithm.

10.3 Decision Trees

As mentioned above, a Decision Tree machine learning model can be used for classification and regression tasks.

In the following Section 10.3.1, the idea behind Decision Trees will be introduced in an intuitive way using a classification task.

10.3.1 The Idea Behind a Decision Tree

The best way to introduce the idea behind Decision Trees is to see the algorithm in action. We will use the Titanic dataset⁷⁷ in R to generate a Decision Tree and discuss the resulting graph.

In the code block below, we download the data, rename the variables into UpperCamel form, and select() the variables \(Survived\), \(Sex\), \(Class\) (renamed from \(Pclass\)), \(Age\), and \(Fare\) (renamed from \(FareInPounds\)) for the analysis. For better readability, we use mutate() to convert \(Survived\) to data type logic (\(Survived=TRUE\) for a passenger that survived and \(Survived=FALSE\) otherwise). Since we perform a classification task, the outcome variable \(Survived\) also has to be converted into a factor variable:

library(rio); library(tidymodels); library(janitor)
DataTitanic=import("https://ai.lange-analytics.com/data/Titanic.csv") |> 
            clean_names("upper_camel") |>
            select(Survived, Sex, Class=Pclass, Age, 
                   Fare=FareInPounds) |>
             mutate(Survived=as.logical(Survived)) |>  
             mutate(Survived=as.factor(Survived))

Afterward, as usual, we create a training and a testing dataset:

set.seed(777)
Split7525=initial_split(DataTitanic, strata=Survived)
DataTrain=training(Split7525)
DataTest=testing(Split7525)
head(DataTrain)

##   Survived    Sex Class Age   Fare
## 1    FALSE   male     3  35  8.050
## 2    FALSE   male     1  54 51.862
## 3    FALSE   male     3   2 21.075
## 4    FALSE   male     3  20  8.050
## 5    FALSE   male     3  39 31.275
## 6    FALSE female     3  14  7.854

In the code block below, we create a recipe and a model design:

RecipeTitanic=recipe(Survived~., data=DataTrain)

ModelDesignDecTree=decision_tree(tree_depth=3) |> 
                   set_engine("rpart") |> 
                   set_mode("classification")

The recipe stored in RecipeTitanic does not scale the data because this is not required for Decision Trees and other tree-based machine learning models.

For the model design, we use the command decision_tree() from the R package rpart (set_engine("rpart")), and set the analysis mode to classification (set_mode("classification")). The argument tree_depth=3 determines that our Decision Tree has three levels (see Figure 10.1).

In a real-world application the tree_depth might be greater than three, but it should not be too high as this can cause overfitting.

To create a fitted workflow model, we add the model design and the recipe to a workflow and then use fit(DataTrain) to fit the model to the training data:

WfModelTitanic=workflow() |> 
               add_model(ModelDesignDecTree) |> 
               add_recipe(RecipeTitanic) |> 
               fit(DataTrain)

The resulting R object WfModelTitanic can be used for predictions and to generate a graphical representation of the fitted Decision Tree:

library(rpart.plot)
rpart.plot(extract_fit_engine(WfModelTitanic), 
           yes.text="YES", no.text="NO",roundint=FALSE)

The command extract_fit_engine(WfModelTitanic) extracts the model from the workflow WfModelTitanic and the rpart.plot() command plots the graphical representation of the fitted Decision Tree (see Figure 10.1).

A Decision Tree that predicts survival on the Titanic

FIGURE 10.1: Decision Tree for Titanic Survival

We will later explain how the Decision Tree in Figure 10.1 was generated. For now, let us focus on how the tree can be used for predicting survival on the Titanic.

A Decision Tree like the one in Figure 10.1 consists of hierarchically organized nodes (the blue and green rectangles). It can be compared to an ancestry tree. The root node on top of the tree has no ancestors but is a parent to two children. Each of these children is a parent to two other children. This process continues until it stops at the bottom of the tree. The terminal nodes at the bottom have parents but no children.

In contrast to an ancestry tree, a Decision Tree is not used to display any ancestry. Instead, it guides observations from a dataset starting at the root node to one of the terminal nodes. Decision rules are used after each parent node to determine to which of the following two child nodes an observation is moved on its way towards the terminal nodes.

Decision Rules

A decision rule determines if an observation is moved to the left child or to the right child of a parent node. A decision rule states a condition such as Sex=male (see the first decision rule between the root node and its two children in Figure 10.1). The condition is either fulfilled (YES) or not fulfilled (NO):

If the condition is fulfilled, the observation moves to the left child node. Otherwise, it moves to the right child node.

Now you know how an observation descends down from the root node to one of the terminal nodes. But how are the three labels inside each node interpreted, and how are they determined?

Let us begin at the root node and in our mind process the complete training dataset with all of its 664 observations at once through the Decision Tree:

The root node contains the complete training dataset, which is indicated by the label “100%” in the third row of the root node. The second label indicates that, sadly, the survival rate was only 0.39 (39%). Since the survival rate is less than \(0.5\), we would predict that a random passenger (observation) does not survive (\(\widehat{Survived}=FALSE\); see the first label in the root node).
Next, all observations are moved from the root node to one of its two child nodes depending if the decision rule Sex=male is fulfilled (left child) or not (right child).

Consequently, male observations (passengers) are moved to the left child node and female observations to the right child node. When reading the three labels from the left child node, we can see that 64% of the training data observations ended up in this node (64% of passengers in the training dataset are male). The survival rate among these passengers was only \(0.2\) (see second label), and therefore, we would predict that an individual male passenger does not survive (\(\widehat{Survived}=FALSE\); \(0.2<0.5\)).

The right child node contains only female observations. From the training data observations, 36% are female. For these female passengers, the survival rate was \(0.73\). Thus, for an individual female passenger, we would predict that she survives (\(\widehat{Survived}=TRUE\); \(0.73>0.5\)).

In summary: After the first split by \(Sex\), we can conclude that men’s survival chances were 20%, while the survival chances of women were 73%.
The process does not stop after the split for \(Sex\). When you look at the left child node below the root node, you can see that the male passengers are further split into older males (13 years and older; moved to the left child node) and younger males (moved to the right child node).

Checking the labels in the left child node (a terminal node), you can see that 60% of the training observations were male and older than \(13\) years. The survival rate among those was only \(0.17\). Therefore, we would predict for an older male that he does not survive (\(\widehat{Survived}=FALSE\); \(0.17<0.5\)).

Checking the labels of the right child node (not a terminal node) below the \(Age>=13\) decision rule shows that 4% of the training observations were young males, and their survival rate was \(0.57\). Thus, for an individual young male (if we do not know in which class he traveled), we would predict that he survives (\(\widehat{Survived}=TRUE\)) because \((0.57>0.5)\).
After all observations from the training dataset descend through the Decision Tree, they will end up in one of the six terminal nodes. Note that the percentages of training data in the six terminal nodes add up to \(100\%\) \((60\%+3\%+1\%+3\%+14\%+18\% \approx 100\%)\).

We know the survival rate for each terminal node based on the training observations that ended up in that node. Therefore, we can predict \((\widehat{Survived}=FALSE\) or \(\widehat{Survived}=TRUE)\) for each terminal node (see the first labels of the six terminal nodes).

Decision Tree Predictions for Binary vs. Continuous Variables

The prediction for a specific node for a binary variable like \(Survived\) (TRUE or FALSE) can be derived as follows:

Using all observations from the training dataset that end up in a specific node, find the proportion for the label TRUE. If this proportion is greater than 50%, predict TRUE for any observation that also ends up in this node. Otherwise, predict FALSE.

The prediction for a specific node for a continuous variable can be derived as follows:

Calculate the mean for the outcome variable from all training data observations that ended up in a specific node. This is the prediction for an observation’s outcome that later also ends up in that node.

Now that you know how the training data were guided through the tree to one of the terminal nodes, you should be able to interpret the Decision Tree as a whole. Try to confirm the following five statements with the Decision Tree in Figure 10.1:

Adult male passengers 13 years or older, regardless of the class they traveled in and the fare they paid, had only a survival rate of \(0.17\).
Young male passengers (younger than 13 years), regardless of which class they traveled and the fare they paid, had a survival rate of \(0.57\).
Young male passengers (younger than 13 years) traveling in Third Class had only a survival rate of \(0.4\) regardless of the fare they paid.
Female passengers, regardless of age and not considering the class they traveled in or the fare they paid, had a survival rate of 0.73.
When considering the class female passengers traveled in, we can see female passengers, regardless of age, had a survival rate of 0.95 when they traveled in First or Second Class regardless of the fare they paid.

Note that not all decision rules and nodes from the Decision Tree in Figure 10.1 make sense — a major weakness of Decision Trees.

For example, if you look at the decision rule that created the fourth and fifth terminal nodes in Figure 10.1 and the related parent node, you can see that females traveling in Third Class had a lower survival rate when the fare they paid was more than 23 British pounds as compared to otherwise (5% vs. 60%). It makes little sense that a higher ticket price in the same class would lower somebody’s survival chances.

We will discuss more problems related to the interpretability of Decision Trees in Section 10.3.2. For now, let us focus on predicting the observations from the testing dataset.

We start with taking a random observation from the testing dataset related to a nine year-old boy traveling in Third Class and see how to predict his survival:

TABLE 10.1: 9-Year Old Boy on the Titanic
Survived	Sex	Class	Age	Fare
TRUE	male	3	9	15.9

From the testing data observation, we know already that the boy did survive \((Survived=TRUE)\). However, we are interested in the Decision Tree’s prediction \((\widehat{Survived})\).

Given that the passenger is male, younger than \(13\) years, and that he travels in Third Class, we can follow the observations through the Decision Tree. Starting at the root node, it moves to the left (male), then to the right (younger than 13 years), and then to the left (traveling in Third Class). The observation ends up in the second terminal node from the left and is predicted not to survive \((\widehat{Survived}=FALSE)\).

However, if you look at the observation above, you can see that the passenger actually did survive. So, we count this testing observation as a False Negative.⁷⁸

If we guide all testing observations through the Decision Tree, we can generate predictions for all testing observations and augment the predictions to the testing data frame DataTest:

DataTestWithPred=augment(WfModelTitanic, new_data=DataTest)
head(DataTestWithPred)

## # A tibble: 6 × 8
##   .pred_class .pred_FALSE .pred_TRUE Survived Sex   
##   <fct>             <dbl>      <dbl> <fct>    <chr> 
## 1 FALSE            0.83        0.17  FALSE    male  
## 2 TRUE             0.0492      0.951 TRUE     female
## 3 FALSE            0.83        0.17  FALSE    male  
## 4 TRUE             0.402       0.598 TRUE     female
## 5 TRUE             0.402       0.598 FALSE    female
## 6 FALSE            0.83        0.17  FALSE    male  
## # ℹ 3 more variables: Class <int>, Age <dbl>,
## #   Fare <dbl>

Since we know for each passenger (observation of the testing dataset) if they survived or not, and also what the Decision Tree predicted (see the .pred_class column), we can generate a confusion matrix and also use the metric_set() command to calculate the metrics accuracy, sensitivity, and specificity for the testing data:

conf_mat(DataTestWithPred, truth=Survived, estimate=.pred_class)

##           Truth
## Prediction TRUE FALSE
##      TRUE    63    14
##      FALSE   23   123

metricSetTitanic=metric_set(accuracy, sensitivity, specificity)
metricSetTitanic(DataTestWithPred, truth=Survived, estimate=.pred_class)

## # A tibble: 3 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 accuracy    binary         0.834
## 2 sensitivity binary         0.733
## 3 specificity binary         0.898

With an overall accuracy of 83%, the prediction results are not bad. The model worked particularly well in predicting passengers that did not survive (negative class; specificity). As you can see, specificity is higher than sensitivity. However, the difference is not big enough to consider the dataset as unbalanced.

This leaves us with one topic that we have not touched on so far:

How does the Optimizer determine the decision rules?

Decision rules are determined from the top of the Decision Tree down to the bottom. The Optimizer starts with finding the best decision rule for the root node, then moves down to the child nodes and finds the decision rule for each of the child nodes and then for the child nodes of the child nodes.⁷⁹ The process stops when either the maximum level set with the hyper-parameter level is reached (level=3 in our case) or when splitting a node further no longer increases the predictive quality.

Decision rules consist of two components:

The splitting variable. That is, the variable used to split the observations from the parent node into the two child nodes. For example, \(Sex\), \(Age\), or \(Fare\).
The splitting value. The splitting value is relevant for continuous variables. For example, when the splitting variable is \(Age\), we have to decide on the passenger’s age that determines if an observation is moved to the left or the right child node (13 years or older for \(YES\) in our case; see Figure 10.1).

To find the best splitting variable with the best splitting value, the Optimizer compares all combinations of splitting variables and splitting values.⁸⁰

What criterion determines if a splitting variable/value combination is good?

The Decision Tree algorithm from the rpart package uses Gini Impurity as a criterion for classification problems. Other measures that can be set include Information Gain and Chi-Square.⁸¹ Here we focus on Gini Impurity.⁸²

The Titanic passengers were different in many ways. Some of their differences (e.g., their \(Sex\)) were crucial for survival, while others were less important. When we split a parent node (for instance, the root node) into two child nodes, we want the two child nodes to be as different as possible regarding survival proportions. Ideally, we would like to get one child node with a survival proportion of \(100\%\) and the other with \(0\%\). This would give us very good predictive quality. In that case, the two nodes would be perfectly pure, one containing a pure group of survivors and the other a pure group of non-survivors. Gini Impurity would be consequently at the lowest level for both child nodes.

Let us follow this admittedly unrealistic path by assuming that all men on the Titanic died and all women survived. A decision rule of Sex=male would create two pure child nodes with one containing only non-survivors and the other containing only survivors. \(Gini Impurity\) should be \(0\) for both pure child nodes.

Gini Impurity for an individual child node is defined by the product of the proportions for the observation for the two different classes (\(P_{Surv.}\) and \(P_{NonSurv.}\)) multiplied by two. In our case:⁸³

\[\begin{equation} G^{Imp}= 2 P_{Surv.} P_{NonSurv.} \tag{10.1} \end{equation}\]

In our example that creates two pure child nodes, Equation (10.1) shows the Gini Impurity for the female child node would indeed be \(0\) (\(P_{Surv.}=1\) and \(P_{NonSurv.}=0\)) and the same would be true for the male child node with only non-survivors (\(P_{Surv.}=0\) and \(P_{NonSurv.}=1\)).

Let us use another extreme example for a decision rule on the top level of the Decision Tree in Figure 10.1 and see how Gini Impurity changes from the root node to the two child nodes. We use Blood Type = O instead of Sex = male. Obviously, this decision rule makes little sense, since blood type will not influence the survival chance on the Titanic. The proportions of survivors in both child nodes would be approximately the same as in the parent node (the root node): \(P_{Surv.}=0.39\) and \(P_{NonSurv.}=0.61\) (see Figure 10.1).

Equation (10.1) shows that the resulting Gini Impurities would be the same for the parent node and the two child nodes:

\[ G^{Imp}_{Parent}=G^{Imp}_{LeftChild}=G^{Imp}_{RightChild}=2\cdot0.39\cdot 0.61=0.48 \]

The value of the Gini Impurity is not so crucial here. What is crucial is that the decision rule is unable to lower the Gini Impurity from the parent node to the child nodes. Gini Impurity was \(0.48\) in the parent node and the (weighted) average from the two child nodes is still \(0.48\). The decision rule is useless!

Finally, let us leave the extreme examples and take a look at the decision rule Sex = male from the Decision Tree in Figure 10.1. By how much could this decision rule lower Gini Impurity? We know already from the last example that the root node has a Gini Impurity of \(G^{Imp}_{Parent}=0.48\). According to Equation (10.1), the Gini Impurity for the left child node is (see Figure 10.1):

\[ G^{Imp}_{male}= 2\cdot0.2\cdot 0.8=0.32 \]

The Gini Impurity for the right child node is:

\[ G^{Imp}_{female}= 2\cdot0.73\cdot 0.27=0.39 \]

The Gini Impurity for a decision rule is the average of the two child nodes weighted by the proportion of the observations that ended up in the two child nodes:

\[ G^{Imp}_{male/female}=0.64 \cdot 0.32 + 0.36 \cdot 0.39= 0.35 \]

Consequently, the decision rule Sex=male decreased the Gini Impurity by \(0.13\) from \(G^{Imp}_{Parent}=0.48\) to \(G^{Imp}_{male/female}=0.35\).

We know that the Optimizer calculated the Gini Impurity decrease for all possible splitting variable/value combinations. Therefore, a Gini Impurity decrease of \(0.13\) must have been the largest. Otherwise, Sex=male would not have been chosen as the decision rule.

10.3.2 The Instability of Decision Trees

You saw already one drawback of Decision Trees in the previous section — not all decision rules make sense.

This section will show another drawback of Decision Trees. They react very sensitively to small changes. For example, in the code block below, the Titanic dataset is split into testing and training data:

set.seed(888)
Split7525=initial_split(DataTitanic, strata=Survived)
DataTrain=training(Split7525)
DataTest=testing(Split7525)

FIGURE 10.2: Decision Tree for Titanic Survival with set.seed(888)

The code is identical to the code from the previous section, except that the set.seed() command is changed from set.seed(777) to set.seed(888), resulting in a change in the composition of the testing and training data.

Afterward, we processed the slightly changed training data with the same R code as in the previous section and created a Decision Tree. When assessing the new Decision Tree based on the testing data by calculating a confusion matrix and the related metrics, you can see below that the results did not change very much.

##           Truth
## Prediction TRUE FALSE
##      TRUE    64    14
##      FALSE   22   123

## # A tibble: 3 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 accuracy    binary         0.839
## 2 sensitivity binary         0.744
## 3 specificity binary         0.898

However, when you compare the structure of the Decision Tree from Figure 10.1 from the previous section to the structure of the Decision Tree in Figure 10.2, which was created using set.seed(888) instead of set.seed(777), you can see that different decision rules are used. A small change in the data led to a major change in the structure of the resulting Decision Tree, making interpretation questionable.

10.3.3 🧭Project: Test the Instability of Decision Trees

In the following interactive project, you can research if what you observed in this section is an exception or if Decision Trees are generally reacting sensitively to small changes.

Interactive Section

In this section, you will find content together with R code to execute, change, and rerun in RStudio.

The best way to read and to work with this section is to open it with RStudio. Then you can interactively work on R code exercises and R projects within a web browser. This way you can apply what you have learned so far and extend your knowledge. You can also choose to continue reading either in the book or online, but you will not benefit from the interactive learning experience.

To work with this section in RStudio in an interactive environment, follow these steps:

Ensure that both the learnR and the shiny package are installed. If not, install them from RStudio’s main menu (Tools -> Install Packages \(\dots\)).
Download the Rmd file for the interactive session and save it in your project folder. You will find the link for the download below.
Open the downloaded file in RStudio and click the Run Document button, located in the editing window’s top-middle area.

For detailed help for running the exercises including videos for Windows and Mac users we refer to: https://blog.lange-analytics.com/2024/01/interactsessions.html

Do not skip this interactive section because besides providing applications of already covered concepts, it will also extend what you have learned so far.

Below is the link to download the interactive section:

https://ai.lange-analytics.com/exc/?file=15-TreeBasedExerc100.Rmd

In the interactive project below, you can change the argument in the set.seed() to create different training and testing data and observe how the confusion matrix based on the testing data and the related metrics changes. You can also observe how the Decision Tree structure changes because, with the code block below, you will generate a graphical representation for the Decision Trees you generate.

To create new Decision Trees, you only need to change the argument in the set.seed() command and execute the code. We recommend the following procedure:

Step 1:: Leave the set.seed(777) as it is and execute the code.
Step 2:: Take a screenshot or use your phone to take a photo of the confusion matrix, the metrics for the testing data, and the graphical representation of the structure of the Desision Tree.
Step 3:: Change the value for set.seed() to whatever value you like and go to Step 2.

Repeat Step 2 and Step 3 as often as you wish and save the results. When done, compare the results saved as screenshots or photos. You will likely find that the confusion matrices and the metrics did not change much, while the structure of the Decision Trees changed considerably.

set.seed(777)
Split7525=initial_split(DataTitanic, strata=Survived)
DataTrain=training(Split7525)
DataTest=testing(Split7525)

RecipeTitanic=recipe(Survived~., data=DataTrain)

ModelDesignDecTree=decision_tree(tree_depth=3) |> 
                    set_engine("rpart") |> 
                    set_mode("classification")

WfModelTitanic=workflow() |> 
               add_model(ModelDesignDecTree) |> 
               add_recipe(RecipeTitanic) |> 
               fit(DataTrain)

DataTestWithPred=augment(WfModelTitanic, new_data=DataTest)

# For better readability, we exchange the positive and 
# negative classes, making TRUE the positive class.
library(tidyverse)
DataTestWithPred=mutate(DataTestWithPred,  
                   Survived=fct_relevel(Survived,"TRUE","FALSE"),
                   .pred_class=fct_relevel(.pred_class,"TRUE","FALSE"))

library(rpart.plot)
rpart.plot(extract_fit_engine(WfModelTitanic), 
           yes.text="YES", no.text="NO",roundint=FALSE)

conf_mat(DataTestWithPred, truth=Survived, estimate=.pred_class)
metricSetTitanic=metric_set(accuracy, sensitivity, specificity)
metricSetTitanic(DataTestWithPred, truth=Survived, estimate=.pred_class)

To summarize, interpreting Decision Trees is easy and straightforward. However, the fact that the structure of Decision Trees reacts so sensitively to small changes is a major obstacle to use Decision Trees for solving real-world problems. Imagine an irresponsible researcher who changes the set.seed() value until the structure of the Decision Tree reflects what they would like to see.

Why did we cover Decision Trees in this chapter if their interpretability is flawed and their predictive quality, although not bad, can be exceeded by other machine learning models?

The reason is that combining several Decision Trees gives us a very strong predictive quality. The combination of machine learning models of the same or different types is called an ensemble.

Combining many Decision Trees into one ensemble is what Random Forest and Boosting Trees models do. We will cover these ensemble models in the following sections.

10.4 Random Forest

A Random Forest is an ensemble model that can be used for classification and regression tasks. A Random Forest consists of many (sometimes hundreds or thousands) of slightly different Decision Trees. When these Decision Trees are created some randomness is involved. Thus the name Random Forest.

Since the Decision Trees in a Random Forest are all slightly different, they produce slightly different predictions. The prediction of a Random Forest model is calculated as the aggregation of all the predictions from the various Decision Trees:

In the case of regression, the prediction from a Random Forest is calculated from the mean predictions of its Decision Trees.
In the case of binary classification, the class that is predicted by the majority of Decision Trees will be the predicted class of the Random Forest. Sometimes, this is called “the vote of the Decision Trees” for a specific class.

10.4.1 The Idea Behind Random Forest

Although each individual Decision Tree inside the Random Forest is not well suited for prediction (weak learners), the idea is that the aggregation of Decision Trees leads to good predictive quality.

The idea that a combination of weak learners can lead to a strong prediction is analogous to the Wisdom of Crowds phenomenon described in Galton (1907):

Visitors at a stock and poultry exhibition in England submitted guesses about the weight of an ox. Although most of the visitors were off with their predictions, surprisingly, the mean of all predictions was very close to the actual weight of the ox.

In order to generate different predictions the Decision Trees in a Random Forest must be diverse — similar to the diverse visitors of Galton’s stock and poultry exhibition. The question is:

How do Random Forest models ensure their Decision Trees are all (slightly) different?

Two strategies are employed by the Random Forest model that ensure a diverse set of Decision Trees:

Random Subspace Method:: Every time a new Decision Tree is created (i.e., the decision rules for the tree), the Random Forest model does not consider all predictor variables. Instead, it uses only a random subset of predictor variables. So, for example, if we have seven predictor variables, two predictor variables are randomly selected as candidates for the best splitting rules for a specific Decision Tree. This limits the predictive quality of individual trees but increases diversity among trees.; The number of predictor variables considered for each tree in tidymodels is by default \(\sqrt{M}\) (rounded down if needed), where \(M\) denotes the number of predictors in the model. So, for example, in a case where we have seven predictors, each Decision Tree will only consider two randomly chosen predictors (\(\sqrt{7}=2.65\); rounded down to \(2\)). \(M\) is a hyper-parameter and can be either set or tuned.
Bagging:: Every time a new Decision Tree is created, the Random Forest model uses a different training dataset. These different training datasets are derived from the original dataset by drawing observations from the original dataset with replacement(!!!) until the new training dataset contains the same number of observations as the original dataset. This procedure is called Bootstrapping.

Bootstrapping and Out-of-Bag Data

The three tables below show a training dataset and two Bootstrap samples derived from it:

Suppose the two Bootstrap samples are used to build Decision Tree 1 and a Decision Tree 2, respectively. You can see that Bootstrap Sample 1 contains the observations for Bertha, Carlos, and Gert twice. At the same time, the observations for Adam, Dora, and Ernst are not included in this dataset. The observations for Adam, Dora, and Ernst are consequently not used to train Decision Tree 1. Since these observations are not in the bag of the training data for Decision Tree 1, they are called Out-of-Bag data.

Out-of-Bag (OOB) data are very useful because they can be used for validation purposes. After all, they were never used to train the related Decision Tree. The number of OOB observations is also reasonably large because the expected value for the number of observations falling in the OOB dataset is about \(1/3\) of the training dataset.

As an exercise try to find the OOB data for Decision Tree 2.⁸⁴

10.4.2 Predicting Vaccination Behavior with Random Forest

Now it is time to showcase an application for Random Forest. In this section, we will use a Random Forest model to predict the U.S. vaccination rates during the COVID-19 pandemic. The analysis is based on research conducted by the author and his co-author, published in October 2022.⁸⁵

The data are from September 2021 and the outcome variable is the proportion of fully vaccinated (two shots) residents in 2,630 continental U.S. counties \((PercVacFull)\).⁸⁶

The following predictor variables were used:

Race/Ethnicity:: Proportion of African Americans (\(PercBlack\)), Asian Americans (\(PercAsian\)), and Hispanics (\(PercHisp\)) for the county.⁸⁷

Political Affiliation:: The proportion of voters who voted for the Republican presidential candidate (\(PercRep\)).⁸⁸ Since only Republican and Democratic votes were considered, the proportion of voters who voted for the Democratic presidential candidate equals (\(1- PercRep\)).

Age Groups: Proportion of young adults (20 – 25 years; \(PercYoung25\)) and proportion of older adults (65 years and older; (\(PercOld65\))).⁸⁹

Income-related: To control for income effects, we used the county’s proportion of households receiving food stamps (\(PercFoodSt\)).⁹⁰

In the code block below, we load the data, select the variables to use, and split the observations into training and testing data:

DataVax=import("https://ai.lange-analytics.com/data/DataVax.rds") |>   
        select(PercVacFull, PercRep,
               PercAsian, PercBlack, PercHisp,
               PercYoung25, PercOld65, 
               PercFoodSt, Population) |> 
        mutate(Population=frequency_weights(Population))
set.seed(2021)
Split85=initial_split(DataVax, prop=0.85, strata=PercVacFull, breaks=3)

DataTrain=training(Split85)
DataTest=testing(Split85)

Above, you can see that \(Population\) (i.e., the population of the related U.S. county)⁹¹ was selected as one of the variables, although \(Population\) is neither an outcome nor a predictor variable for this research. \(Population\) will be used later in the workflow model to weigh the observations. To mark the variable \(Population\) as being neither outcome nor predictor variable, we used the command frequency_weights().

The recipe() in the code block below is the same as in Section 10.3.2 except that now \(PercVacFull\) is chosen as the outcome variable.

RecipeVax=recipe(PercVacFull~., data=DataTrain)

The dot in the argument PercVacFull~. indicates that all variables in DataTrain except the outcome, \(PercVacFull\), should be used as predictor variables. However, the variable \(Population\) is also excluded because we marked it as frequency weight when we loaded the data and selected the variables.

In the code block below, we define the model design and use the rand_forest() command to choose a Random Forest model from the R ranger package (see set_engine()). To keep things simple, we set the hyper-parameters rather than tuning them:

library(parallel)
ModelDesignRandFor=rand_forest(min_n=5, mtry=2, trees=2000) |> 
                    set_engine("ranger",  num.threads=detectCores()) |> 
                    set_mode("regression")

We set min_n=5, which means that at least five observations are required in a node of a Decision Tree to allow a split and to create two new child nodes. The hyper-parameter mtry determines the number of randomly chosen variables as candidates for a split in the Decision Trees. It is set to mtry=2. Both settings coincide with the defaults for Random Forest.⁹²

We increased the number of Decision Trees used for the Random Forest from the default (trees=500) to trees=2000. In contrast to the hyper-parameters min_n and mtry, a high number of trees in a Random Forest cannot cause overfitting.

The last argument we provide in the code block above is determined in the set_engine() command. The Random Forest implementation in tidymodels allows us to execute R code parallel on multiple computer cores to speed up computing time. Random Forest is well suited for parallel computing because the various Decision Trees can be developed independently from each other in any order and afterward combined into a Random Forest. We set the argument num.threads (the number of processes that run in parallel) to be equal to the number of computer cores of the executing computer (num.threads=detectCores()). This worked well on a computer with 16 logical cores.

In the code block below, we create the workflow and add the recipe, the model design, and we add the variable \(Population\) with add_case_weights(Population) to weigh the observations with their respective county’s population:

set.seed(2021)
WfModelVax=workflow() |>                             
           add_recipe(RecipeVax) |> 
           add_model(ModelDesignRandFor) |> 
           add_case_weights(Population) |> 
           fit(DataTrain)

Weighting the county observations with their population is needed because the U.S. counties have very different population sizes. Using unweighted observations would implicitly assign the same weight to counties with a few hundred residents as counties with millions of residents, which would not be reasonable.

When printing WfModelVax, you can see the setup of the fitted workflow and, in addition, how it performed on the OOB dataset.

## ══ Workflow [trained] ═══════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ─────────────────────
## 0 Recipe Steps
## 
## ── Case Weights ─────────────────────
## Population
## 
## ── Model ────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2,   
## 
## Type:                             Regression 
## Number of trees:                  2000 
## Sample size:                      2234 
## Number of independent variables:  7 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.01218 
## R squared (OOB):                  0.4281

The performance on the OOB validation data is pretty good, but the real challenge is the testing data. To see how the model performs on the testing data, we have to augment the testing data with the predictions and then use the metrics() command to generate and print the performance metrics:

DataTestWithPred=augment(WfModelVax, new_data=DataTest)
metrics(DataTestWithPred, truth=PercVacFull, estimate=.pred)

## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.109 
## 2 rsq     standard      0.458 
## 3 mae     standard      0.0753

The Random Forest model over/underestimates the counties’ vaccination rates by about eight percentage points (\(mae=0.0753\)).

In the next section, 10.5, we will introduce some Boosting Trees algorithms. Boosting Trees algorithms are more advanced than Random Forest. They run faster, and their predictive performance is similar and sometimes better than the performance of Random Forest.

10.5 Boosting Trees Algorithms

As mentioned above, Boosting Trees algorithms are an improvement of Random Forest. They are based on many Decision Trees like Random Forest. However, the underlying Decision Trees are not randomly modified as in Random Forest. Instead, they are boosted by weighing the training data used for each Decision Tree or by changing how the Decision Trees are created and combined. Examples for Boosting Trees algorithms are:

AdaBoost: \(AdaBoost\) is an ensemble algorithm like Random Forest. The AdaBoost algorithm uses only tree stumps (Decisision Trees with one level tree depth) for its trees. It starts by using the outcome and predictor variables from the training dataset to create the first tree.⁹³ Then, for the second and following Decision Trees, the errors of the previous Decision Tree are used as outcome variables. When these Decision Trees are combined to generate a prediction, they are weighted according to their predictive quality. Higher weight is assigned to trees with better predictive quality. This weighting of the trees is in contrast to a Random Forest, which uses an unweighted average for combining the predictions from its Decision Trees.

We mention AdaBoost here because it was one of the earliest Boosting Trees algorithms. However, we will not go into more detail because, meanwhile, more powerful Boosting Trees algorithms have been developed.⁹⁴
Gradient Boosting: This is a very powerful algorithm. Gradient Boosting starts with creating a first prediction based on the mean of the training dataset’s outcome variable. To create the first tree, it uses the errors from the first prediction as the outcome variable and the variables from the training data as predictor variables. Afterward, it combines the initial prediction and the prediction from the first Decision Tree. Then, based on the resulting new prediction errors, it creates a second Decision Tree. Afterward, based on the errors from this prediction, it creates a third tree. This process continues until a predefined number of trees is created, or another stopping criteria is reached.

We will cover Gradient Boosting in more detail in Section 10.5.1.⁹⁵
XGBoost: The XGBoost algorithm⁹⁶ is a variation of Gradient Boosting. The major difference is that it is optimized for performance as it supports parallelization to use multiple computer cores, and it supports distributed computing to run the algorithm simultaneously on a cluster of computers. This makes XGBoost significantly faster than regular Gradient Boosting especially for large datasets. In addition, XGBoost penalizes complex models, which helps to avoid overfitting.

You will use the tidymodels implementation of XGBoost in the interactive Section 10.5.2 to predict vaccination behavior with the same dataset that was used in the previous section with Random Forest.⁹⁷
LightGBM and CATBoost: These two algorithms are mentioned here because they are improvements of XGBoost in terms of computer processing time and the data volume the algorithms can handle, but they will not be covered further in this book.⁹⁸

Both algorithms are currently not directly available through tidymodels, but they can be used in connection with the treesnip package.

10.5.1 The Idea Behind Gradient Boosting

As mentioned above Gradient Boosting builds Decision Trees based on prediction errors. Since no prediction errors are available at the starting point when building the Gradient Boosting ensemble, an initial prediction is needed. Gradient Boosting starts with the mean of the outcome variable as prediction for all observations. Afterward, it uses the errors from previous predictions to build new trees to enrich the ensemble.

To explain the process in more detail, let us see step-by-step how Gradient Boosting works. To keep it simple, we limit the ensemble to only three Decision Trees. In reality Gradient Boosting works with many more Decision Trees.⁹⁹

Let us assume we want to create an ensemble with Decision Trees \(D_1^{ecTree}\), \(D_2^{ecTree}\), and \(D_3^{ecTree}\). The goal is to predict an outcome variable \(Y_i\).

As mentioned above, the initial prediction is only based on the mean of the outcome variable from the training dataset:

\[\begin{equation} \widehat{Y}_i=\overline{Y}:=\frac{1}{N}\sum_{i=1}^N Y_i \tag{10.2} \end{equation}\]

Note that Equation (10.2) implies that the predictions for all observations are the same because the mean of the outcome variable is a single number. This is reasonable, but Equation (10.2) represents a very weak learner, and thus we have to expect large prediction errors \((u_{0,i})\). We can calculate these errors for all training observations because we know the true outcome for each training observation:

\[\begin{equation} u_{0,i}=Y_i - \overline{Y} \tag{10.3} \end{equation}\]

Note that the index \(0\) in \(u_{0,i}\) indicates that the errors for the \(i\) observations are related to the starting prediction in the ensemble. Equation (10.3) can also be written as Equation (10.4), which states that at the initial stage, the known outcomes from the training dataset \(Y_i\) consist of the predicted outcome \((\overline{Y})\) and the related errors \((u_{0,i})\):

\[\begin{equation} Y_i=\overline{Y}+u_{0,i} \tag{10.4} \end{equation}\]

Because we used a weak learner, the errors, \(u_{0,i}\), most likely contain systematic impacts on the outcome. To integrate these systematic impacts, we can create a Decision Tree \((D_1^{ecTree}(Obs_i))\) that uses the known errors (see Equation (10.3)) as values for the outcome variable and variables from the training dataset as predictor variables \((Obs_i)\). Predicting errors seems to be a little odd, but please bear with us. It will make perfect sense very soon.

Below is the prediction equation for the initial errors based on the first Decision Tree:

\[\begin{equation} \widehat{u}_{0,i}=D_1^{ecTree}(Obs_i) \tag{10.5} \end{equation}\]

Since the Decision Tree in Equation (10.5) does not predict the initial errors \((u_{0,i})\) perfectly, it will also create its own prediction errors \((u_{1,i})\). Consequently, the true initial errors consist of the predictions from the first Decision Tree and the errors related to these predictions:

\[\begin{equation} u_{0,i}=\underbrace{D_1^{ecTree}(Obs_i)}_{\widehat{u}_{0,i}}+u_{1,i} \tag{10.6} \end{equation}\]

Substituting Equation (10.6) into Equation (10.4) gives us:

\[\begin{equation} Y_i=\underbrace{\overline{Y}+D_1^{ecTree}(Obs_i)}_{\widehat{Y}_i}+u_{1,i} \tag{10.7} \end{equation}\]

Now, you can see why predicting errors makes sense. Equation (10.7) shows that the predictions \(\widehat{Y}_i\) improved. It is not only based on the mean of the outcome \((\overline{Y})\) but also on the first Decision Tree’s prediction \((D_1^{ecTree}(Obs_i))\).

Gradient Boosting uses a slight modification from the prediction in Equation (10.7). Gradient Boosting predicts the outcome at the stage of the first Decision Tree as:

\[\begin{equation} Y_i=\underbrace{\overline{Y}+\gamma D_1^{ecTree}(Obs_i)}_{\widehat{Y}_i}+u_{1,i} \quad \mbox{with: } 0<\gamma<1 \tag{10.8} \end{equation}\]

Equation (10.8) is identical to Equation (10.7), except that the influence of the Decision Tree on the prediction is weakened by multiplying with the learning rate \(\gamma\). The learning rate \(\gamma\) is a tuneable hyper-parameter. It is usually set to values considerably smaller than one. For example, the Gradient Boosting algorithm XGBoost in tidymodels uses \(\gamma=0.3\) as the default learning rate.

Since in Equation (10.8) \(\gamma\) weakens the influence of the first and only Decision Tree, it is reasonable to assume that the related errors \((u_{1,i})\) still contain some systematic impacts on the outcome variable \(Y\). Therefore, we create a second Decision Tree \((D_2^{ecTree}(Obs_i))\) based on the predictor variables in the training dataset and the known errors from the first Decision Tree. These errors are known because we can calculate them as the difference between the known outcome values from the training dataset and the predictions from Equation (10.8):

\[\begin{equation} u_{1,i} =Y_i-\underbrace{\overline{Y}+\gamma D_1^{ecTree}(Obs_i)}_{\widehat{Y}_i} \quad\mbox{with: } 0<\gamma<1 \nonumber \end{equation}\]

After creating the second Decision Tree \((D_2^{ecTree}(Obs_i))\) to predict the errors from the first Decision Tree \((u_{1,i})\), the prediction equation for the outcome \(Y\) improves to:

\[\begin{equation} Y_i=\underbrace{\overline{Y}+\gamma D_1^{ecTree}(Obs_i)+\gamma D_2^{ecTree}(Obs_i)}_{\widehat{Y}_i}+u_{2,i} \quad\mbox{ with: } 0<\gamma<1 \nonumber \end{equation}\]

The errors from the prediction by the second Decision Tree \((u_{2,i})\) may still contain some systematic information, and since their values are known, we can use a third Decision Tree to integrate the systematic impacts from these errors, which leads to:

\[\begin{eqnarray} Y_i&=&\underbrace{\overline{Y}+\gamma D_1^{ecTree}(Obs_i)+\gamma D_2^{ecTree}(Obs_i)+\gamma D_3^{ecTree}(Obs_i)}_{\widehat{Y}_i}+u_{3,i} \nonumber \\ && \mbox{with: } 0<\gamma<1 \nonumber \\ && \nonumber \\ \widehat{Y}_i &=& \overline{Y}+\gamma D_1^{ecTree}(Obs_i)+\gamma D_2^{ecTree}(Obs_i)+\gamma D_3^{ecTree}(Obs_i) \tag{10.9} \\ && \mbox{with: } 0<\gamma<1 \nonumber \end{eqnarray}\]

We stop here since we decided to use only three Decision Trees. In real-world applications, many more trees would be added. In fact, the number of Decision Trees to be added is a tuneable hyper-parameter for Gradient Boosting models.

Equation (10.9) can be used to predict the outcome \(Y_i\) for any observation \(i\) as long as the values for the predictor variables \((Obs_i)\) are known. For example, we could use the observations from a testing dataset to assess how well a Gradient Boosting model performs.

Equation (10.9) also allows us to explain why weakening the influence of the Decision Trees through the learning rate \(\gamma\) makes sense. Assume we would have a prediction equation such as Equation (10.9) without a learning rate \(\gamma\) but with many more Decision Trees. The errors of each Decision Tree would be corrected by the following Decision Tree, and this correction would not be artificially weakened. Such an ongoing unregulated correction could lead to a severe overfitting problem because the prediction equation could possibly approximate the training data almost perfectly.

10.5.2 🧭Using XGBoost to Predict Vaccination Rates

Interactive Section

In this section, you will find content together with R code to execute, change, and rerun in RStudio.

To work with this section in RStudio in an interactive environment, follow these steps:

Ensure that both the learnR and the shiny package are installed. If not, install them from RStudio’s main menu (Tools -> Install Packages \(\dots\)).
Download the Rmd file for the interactive session and save it in your project folder. You will find the link for the download below.
Open the downloaded file in RStudio and click the Run Document button, located in the editing window’s top-middle area.

For detailed help for running the exercises including videos for Windows and Mac users we refer to: https://blog.lange-analytics.com/2024/01/interactsessions.html

Do not skip this interactive section because besides providing applications of already covered concepts, it will also extend what you have learned so far.

Below is the link to download the interactive section:

https://ai.lange-analytics.com/exc/?file=15-TreeBasedExerc200.Rmd

In this section, you will use a Gradient Boosting algorithm to predict vaccination rates in U.S. counties based on socioeconomic predictor variables. We will use the same data and predictor variables that we used in Section 10.4 (see Section 10.4.2 for details about the data).

To be precise, we will use XGBoost, which is available in the tidymodels package. XGBoost is based on Gradient Boosting, but it is a more advanced and a more effective type of Gradient Boosting. This is because XGBoost is optimized for parallel processing and thus can run simultaneously on different CPUs on your computer. Optionally, it can run in a distributed environment. That is, it can run on different computers at the same time. You will see how fast the XGBoost algorithm is when you run and tune an XGBoost machine learning model on your computer, but first, let us prepare the tuning.

We start with downloading the data, selecting the outcome, the predictor variables, and the variable we will use later to weigh our observations.

library(rio); library(tidymodels); library(xgboost)
DataVax=import("https://ai.lange-analytics.com/data/DataVax.rds") |>
        select(PercVacFull, PercRep,
               PercAsian, PercBlack, PercHisp,
               PercYoung25, PercOld65, 
               PercFoodSt, Population) |> 
               mutate(Population=frequency_weights(Population))

Since we will tune some of the hyper-parameters of the XGBoost algorithm, we again use the 10-Step Tuning Template from Section 6.6:

Step 1 - Generate Training and Testing Data:

The training and testing data are generated as follows:

set.seed(2021)
Split85=initial_split(DataVax, prop=0.85, strata=PercVacFull, 
                      breaks=3)

DataTrain=training(Split85)
DataTest=testing(Split85)

Step 2 - Create a Recipe:

The recipe below determines the outcome variable (PercVacFull) and chooses all other variables as predictor variables, which excludes \(Population\) because this variable was set to frequency_weights() when the data were loaded and the variables were selected:

RecipeVax=recipe(PercVacFull~., data=DataTrain)

Step 3 - Create a Model Design:

The model design below uses the boost_tree() command to choose a Gradient Boosting machine learning model. The algorithm XGBoost is selected in the set_engine() command with the argument xgboost. Since we predict a continuous variable (the vaccination rate), set_mode() is set to regression:

ModelDesignBoostTrees=boost_tree(trees=tune(), tree_depth=tune())|> 
                      set_engine("xgboost") |> 
                      set_mode("regression")

For the boost_tree() command, we set two hyper-parameters up for tuning. The hyper-parameter trees determines the number of Decision Trees, and the hyper-parameter tree_depth specifies how many levels each of these Decision Trees has.

Step 4 - Add the Recipe and the Model Design to a Workflow:

As before, we add the recipe and the model design to a workflow:

WfModelVax=workflow() |> 
           add_model(ModelDesignBoostTrees) |> 
           add_recipe(RecipeVax) |> 
           add_case_weights(Population)

Note the command add_case_weights(Population) at the end of the workflow model. It ensures that the observations (the U.S. counties) are weighted according to their population.

Steps 1 – 4 are already prepared for you and executed in the background. Below, you can execute the code block for Steps 5 – 10 from the 10-Step Tuning template.

In Step 5, the hyper-parameter values that are tried out are determined in the parameter grid. We encourage you to experiment with these values.

In Step 6, ten folds are chosen for Cross-Validation. You may change v=10 to a lower value to speed up the tuning process.

Step 7 tunes the workflow model. Note that the command doParallel::registerDoParallel() prepares the tuning for parallelization. It is essential to install the doParallel package before you tune. Otherwise, R will throw an error. After tuning is completed, the autoplot command will visualize the Cross-Validation performance for all chosen hyper-parameter combinations.

In Step 8, the hyper-parameters that performed best with the Cross-Validation folds are chosen to use them in a final model in Step 9.

This best model is then evaluated based on the testing data in Step 10.

After you load the code below into RStudio (see the info box at the beginning of this section), we recommend to Run the Document in a browser without any changes. Afterward, evaluate the diagram generated by autoplot(), change the hyper-parameter values in Step 5 accordingly, and execute the code block again.

# Step 5 - Create a Hyper-Parameter Grid
set.seed(2021)
ParGridVax=expand.grid(tree_depth=c(1, 2, 5, 10, 15), 
                       trees=c(5, 10, 15, 50, 100))
  
# Step 6 - Create Resamples for Cross-Validation:
FoldsVax=DataTrain |>
          vfold_cv(v=10, strata=PercVacFull, breaks=5)

# Step 7 - Tune the Workflow and Train All Models:
# Make sure the doParallel package is installed!!!
doParallel::registerDoParallel() 

set.seed(2021)
StartTime=Sys.time()
TuneResultsVax=tune_grid(WfModelVax,
                         resamples=FoldsVax,
                         grid=ParGridVax,
                         metrics=metric_set(mae))
RunTime=Sys.time()- StartTime

# Visualize tuning results
print("TUNING RESULTS:")
autoplot(TuneResultsVax)

# Step 8 - Extract the Best Hyper-Parameter(s):
BestParVax=select_best(TuneResultsVax, metric="mae")


# Step 9 - Finalize and Train the Best Workflow Model:
set.seed(2021)
BestWFModelVax=finalize_workflow(WfModelVax, BestParVax) |> 
               fit(DataTrain)

# Step 10 - Assess Prediction Quality Based on the Testing Data:
DataTestWithPred=augment(BestWFModelVax, new_data=DataTest)
MetricsBestModel=metrics(DataTestWithPred, truth=PercVacFull, 
                         estimate=.pred)

# Print computation time
print("TUNING TIME:") 
print(RunTime)

# Best parameters from tuning
print("BEST PARAMETERS:")
print(BestParVax)

# Print metrics for best model
print("TESTING DATA METRICS BEST MODEL:")
print(MetricsBestModel)

When you execute the code block above unchanged in RStudio, your results should be similar to the ones below.

Figure 10.3 shows a visualization for the predictive performance for all tried out hyper-parameter values based on Cross-Validation.

Tuning the Number of Trees for the Random Forest model

FIGURE 10.3: Tuning the Tree Depth and the Number of Trees

You can see that the mae (mean absolute error) was lowest for a tree_depth value of two in combination with \(50\) Decision Trees. This makes sense. A Decison Tree with only two levels is a very weak learner. Therefore, even after processing the errors through many trees, some systematic information was left over so that even Decision Tree number \(50\) had something to contribute.

You can see that tree_depth values of 10 and 15 did not perform very well. Maybe you should try a few tree_depth values below or slightly above five.

The situation is not as clear for the number of trees. It looks like that significantly more than \(50\) trees cannot considerably lower the error. However, for a Decision Tree with only one level (also called a tree stump), it might make sense to increase the number of trees used to more than \(100\). This might lower the mae even beyond the current minimum because the red line in Figure 10.3 indicates a falling mae for tree_depth=1.

10.6 When and When Not to Use Tree-Based Models

Decision Trees have a high educational value because the graphical representation provides an intuitive way to see which variables influenced the predictions.
As a standalone model Decision Trees should not be used for solving real-world problems. The reasons are:
- Although most decision rules are reasonably interpretable, some decision rules might not make sense. This is a serious drawback from the otherwise good interpretability of Decision Trees.
- The structure of Decision Trees responds very sensitively to a change in hyper-parameters such as the tree depth or minor changes in the data. With an unstable tree structure, interpretation becomes challenging and is not credible.
- Decision Trees are weak learners in the sense that other machine learning models often perform better.
Although Decision Trees are weak learners, combining many Decision Trees into an ensemble model such as Random Forest or Boosting Trees models often leads to excellent predictive results.
Random Forest models are a good choice for regression and classification tasks. This is especially true when non-linearity or predictor variable interactions are suspected in the underlying data. Generally, using a linear OLS model as a benchmark is always a good idea to evaluate if the data present a linear process that can be addressed with OLS.
For very complex models with many predictor variables, Deep Learning models such as Neural Networks may be a better choice, and these models should be run in addition to Random Forest as an alternative to evaluate, if they perform better.
If computing time is an issue when running a Random Forest model, Boosting Trees algorithms such as XGBoost might be the better choice. For example, XGBoost performs faster than Random Forest, and its predictive quality is similar and often better than Random Forest.

10.7 Digital Resources

Below you will find a few digital resources related to this chapter such as:

Videos
Short articles
Tutorials
R scripts

These resources are recommended if you would like to review the chapter from a different angle or to go beyond what was covered in the chapter.

Here we show only a few of the digital resourses. At the end of the list you will find a link to additonal digital resources for this chapter that are maintained on the Internet.

You can find a complete list of digital resources for all book chapters on the companion website: https://ai.lange-analytics.com/digitalresources.html

10.7.1 Decision Trees

Decision and Classification Trees, Clearly Explained!!! from StatQuest by Josh Starmer

The video introduces the basics of Decision Trees for classification. It is a good video to start learning about Decision Trees.

Link: https://ai.lange-analytics.com/dr?a=407

Regression Trees, Clearly Explained!!! from StatQuest by Josh Starmer

The video introduces the basics of Decision Trees used for regression. We recommend watching the StatQuest video about Decision Trees for classification first.

Link: https://ai.lange-analytics.com/dr?a=403

Decision Trees in Machine Learning Using R from DataCamp by James Le and Arunn Thevapalan

This is a free tutorial from DataCamp about Decision Trees. The provided R code shows how to build a Decision Tree model using tidymodels.

Link: https://ai.lange-analytics.com/dr?a=408

More Digital Resources

Only a subset of digital resources is listed in this section. The link below points to additional, concurrently updated resources for this chapter.

Link: https://ai.lange-analytics.com/dr/dectrees.html

10.7.2 Random Forest

Random Forests Part 1 - Building, Using and Evaluating from StatQuest by Josh Starmer

This video is Part 1 of a video series about Random Forest. The video explains the basics of Random Forest, including an introduction to Bootstrapping.

Link: https://ai.lange-analytics.com/dr?a=409

Tuning Hyper Parameters in a Random Forest Model

This blog post by Carsten Lange shows how the hyper-parameters of a Random Forest model that estimates COVID-19 vaccination rates in 2021 can be tuned using the 10-Step Tuning template.

Link: https://ai.lange-analytics.com/dr?a=353

More Digital Resources

Only a subset of digital resources is listed in this section. The link below points to additional, concurrently updated resources for this chapter.

Link: https://ai.lange-analytics.com/dr/randforest.html

10.7.3 Boosting Trees Algorithms

Gradient Boosting Video from StatQuest

A video from StatQuest by Josh Starmer. The video is the first part of four videos about Gradient Boosting. This video focuses on the main ideas behind using Gradient Boosting to predict a continuous variable.

Link: https://ai.lange-analytics.com/dr?a=404

Tuning XGBoost with tidymodels

This blog post from tidyTuesday by Julia Silge describes how to use tidymodels to build and tune an XGBoost model. The goal is to predict the outcome of volleyball games. The blog post also provides a video and an R tutorial.

Link: https://ai.lange-analytics.com/dr?a=406

More Digital Resources

Only a subset of digital resources is listed in this section. The link below points to additional, concurrently updated resources for this chapter.

Link: https://ai.lange-analytics.com/dr/boosttrees.html

References

Aden-Buie, Garrick, Barret Schloerke, and JJ Allaire. 2022. Learnr: Interactive Tutorials for r. https://CRAN.R-project.org/package=learnr.

Centers for Disease Control and Prevention (CDC). 2021a. “Vaccine Hesitancy for COVID-19: County and Local Estimate.” Online, June.

———. 2021b. “COVID-19 Vaccinations in the United States, County.” Online, September.

Chan, Chung-Hong, Geoffrey C. H. Chan, Thomas J. Leeper, and Jason Becker. 2021. Rio: A Swiss-Army Knife for Data File i/o.

Chang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara Borges. 2022. Shiny: Web Application Framework for r. https://CRAN.R-project.org/package=shiny.

Chen, Tianqi, and Carlos Guestrin. 2016. “XGBoost a Scalable Tree Boosting System.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, edited by Balaji Krishnapuram, 321–57. ACM, New York, NY. https://doi.org/10.1145/2939672.2939785.

Chen, Tianqi, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, et al. 2023. XGBoost: Extreme Gradient Boosting. https://CRAN.R-project.org/package=xgboost.

Corporation, Microsoft, and Steve Weston. 2022. Doparallel: Foreach Parallel Adaptor for the ’Parallel’ Package. https://CRAN.R-project.org/package=doParallel.

Delgado, Fernando. 2022. “A Beginner’s Guide to CatBoost with Python.” MLearning.ai, June. https://medium.com/mlearning-ai/a-beginners-guide-to-catboost-with-python-763d7e7ac199.

Esri (Environmental Systems Research Institute). 2023. “Geoenrichment.” Online. https://www.esri.com/en-us/arcgis/products/location-services/services/geoenrichment.

Firke, Sam. 2023. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.

Freund, Yoav, and Robert E. Schapire. 1996. “Experiments with a New Boosting Algorithm.” In Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, 148–56. ICML’96. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Galton, Francis. 1907. “Vox Populi.” Nature 75 (1949): 450–51. https://doi.org/10.1038/075450a0.

Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.

Kurama, Vihar. 2018. “A Guide to AdaBoost: Boosting to Save the Day.” Paperspace Blog, Series: Ensemble Methods.

Lange, Carsten, and Jian Lange. 2022. “Applying Machine Learning and AI Explanations to Analyze Vaccine Hesitancy.” arXiv, January.

Milborrow, Stephen. 2022. Rpart.plot: Plot ’Rpart’ Models: An Enhanced Version of ’Plot.rpart’. https://CRAN.R-project.org/package=rpart.plot.

Morde, Vishal. 2019. “XGBoost Algorithm: Long May She Reign!” Towards Data Science, April.

Park, A. et al. 2021. “Presidential Precinct Data for the 2020 General Election.” Edited by New York Times. New York Times, April. https://github.com/TheUpshot/presidential-precinct-map-2020.

Pramoditha, Rukshan. 2021. “Can LightGBM Outperform XGBoost? Boosting Algorithms in Machine Learning — Part 5.” Towards Data Science, October. https://towardsdatascience.com/can-lightgbm-outperform-xgboost-d05a94102a55.

R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Sharma, Abhishek. 2020. “4 Simple Ways to Split a Decision Tree in Machine Learning.” Analytics Vidhya, June. https://www.analyticsvidhya.com/blog/2020/06/4-ways-split-decision-tree/.

Singh, Himanshi. 2021. “How to Select Best Split in Decision Trees Using Gini Impurity.” Analytics Vidhye, March. https://www.analyticsvidhya.com/blog/2021/03/how-to-select-best-split-in-decision-trees-gini-impurity/.

Therneau, Terry, and Beth Atkinson. 2022. Rpart: Recursive Partitioning and Regression Trees. https://CRAN.R-project.org/package=rpart.

———. 2023b. “Diversity Index.” Wikipedia, the Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Diversity_index&oldid=1189901595.

Wright, Marvin N., and Andreas Ziegler. 2017. “ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77 (1): 1–17. https://doi.org/10.18637/jss.v077.i01.

Yıldırım, Soner. 2020. “Gradient Boosted Decision Trees-Explained.” Towards Data Science, February. https://towardsdatascience.com/gradient-boosted-decision-trees-explained-9259bd8205af.

Zhu, Hao. 2021. kableExtra: Construct Complex Table with ’kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.

See Section 3.5.2 for more details about the Titanic dataset.↩︎
Note, tidymodels treats TRUE (survived) as the positive class.↩︎
Decision rules of the previous level are never reversed, even if this would allow for a better decision rule on the current level. There is no turning back to reverse a decision rule on a previous level. This type of algorithm is called a greedy algorithm.↩︎
There are a few shortcuts that avoid considering all values of a variable as splitting values, but this exceeds the scope of this book.↩︎
See Sharma (2020) for an intuitive description of the different standards.↩︎
Singh (2021) provides an intuitive introduction about Gini Impurity.↩︎
Gini Impurity is calculated for an individual node and estimates “(\(\dots\)) the probability that two entities taken at random from the dataset of interest (with replacement) represent (\(\dots\)) different types” (Wikipedia contributors (2023b)).↩︎
The OOB sample for Decision Tree 2 consists of Bertha and Gert.↩︎
See Lange and Lange (2022).↩︎
Source: Centers for Disease Control and Prevention (CDC) (2021b).↩︎
Source: Centers for Disease Control and Prevention (CDC) (2021a).↩︎
Source for the raw data: Park, A. et al. (2021). The authors of Lange and Lange (2022) calculated the proportions.↩︎
Source: Centers for Disease Control and Prevention (CDC) (2021a).↩︎
Source: Esri (Environmental Systems Research Institute) (2023). The author thanks ESRI for the permission to use their proprietary data for the interactive sections of this book.↩︎
Source: Centers for Disease Control and Prevention (CDC) (2021a).↩︎
min_n=5 is the default for regression problems, and the default for mtry is calculated as \(\sqrt{Number of Predictors}=\sqrt{7} \approx 2\) (rounded down), as we have seven predictor variables.↩︎
AdaBoost is based on a paper by Freund and Schapire (1996).↩︎
See Kurama (2018) for more details about AdaBoost.↩︎
See Yıldırım (2020) for an introduction to Gradient Boosting.↩︎
See Chen and Guestrin (2016).↩︎
See Morde (2019) for an introduction to XGBoost.↩︎
See Pramoditha (2021) for and introduction to LightGBM and Delgado (2022) for CatBoost.↩︎
For example, the default for the XGBoost algorithm we use in this section is \(15\) and can be tuned. Using \(100\) or more trees is not uncommon.↩︎