Chapter 6 Polynomial Regression — Overfitting/Tuning Explained

This is an Open Access web version of the book Practical Machine Learning with R published by Chapman and Hall/CRC. The content from this website or any part of it may not be copied or reproduced. All copyrights are reserved.

If you find an error, have an idea for improvement, or have a question, please visit the forum for the book. You find additional resources for the book on its companion website, and you can ask the experimental AI assistant about topics covered in the book.

If you enjoy reading the online version of Practical Machine Learning with R, please consider ordering a paper, PDF, or Kindle version to support the development of the book and help me maintain this free online version.

The importance of creating training and testing data, and how this can be done in R, was emphasized in the previous chapters. Splitting the data into training and testing data is necessary because you cannot use the same data to train (calibrate) a model’s parameters and assess a model’s predictive quality. If you did that, your model would be overfitted!

Overfitting is a situation where a fitted model approximates the training data well, but fails on new observations, which it has never seen before.

In this chapter, we first explain the problem of overfitting in more detail (see Section 6.3). Then in Section 6.4, we use a univariate Polynomial Model to demonstrate overfitting with an example. This model will predict house prices based on various powers of the predictor variable square footage (e.g., \(Sqft\), \(Sqft^2\), \(Sqft^3\), \(\dots\)).

Polynomial Model

A model that expresses the relationship between the predicted outcome and the predictor variable(s) as a polynomial function is called a Polynomial Model. A Polynomial Model involving one predictor variable is called a univariate Polynomial Model. In contrast a Polynomial Model involving two or more predictor variables is called a multivariate Polynomial Model.

Example - Univariate Polynomial Model with predictor variable \(x\):

\[ \widehat{y}= \beta_1 x +\beta_2 x^2 +\beta_3 x^3 +\beta_4 \]

Example - Multivariate Polynomial Model with two predictor variables ( \(x_1\) and \(x_2\)):

\[ \widehat{y}= \beta_1 x_1 +\beta_2 x_1^2+ \beta_3 x_1 x_2+ \beta_4 x_2 + \beta_5 x_2^2+\beta_6 \]

The term \(x_1 x_2\) is called an interaction term. This is because the term \(x_1 x_2\) only has a high value (high influence on the outcome) if \(x_1\) and \(x_2\) have high values. For example, to predict wage the two numerical predictors skill level and years of experience could be modeled as an interaction term because it requires both skill and experience to earn a high wage.

In what follows, we will work with a univariate Polynomial Model. We will use a very small training dataset to calibrate the model parameters. The small training dataset in connection with the Polynomial Model will lead to an overfitting scenario.

In Section 6.4, the overfitting scenario will allow us to visualize why overfitting can occur and why it is problematic for predicting new observations.

In Section 6.5, we will introduce hyper-parameter tuning, an iterative procedure that helps to avoid overfitting. Hyper-parameter tuning will help us to find the best model complexity (degree of the power for \(Sqft\)) for the Polynomial Model.

In Section 6.6 we will introduce a tuning template based on the R tidymodels package. This template will allow you to use the concept of hyper-parameter tuning with any machine learning model as long as it is supported by tidymodels.

In Section 6.7 you can apply the template from Section 6.6 to work on an interactive project. You will try to find the best value for the hyper-parameter \(k\) for the k-Nearest Neighbors model that was used in Chapter 4 to classify wines into red and white wines.

6.1 Learning Outcomes

This section outlines what you can expect to learn in this chapter. In addition, the corresponding section number is included for each learning outcome to help you to navigate the content, especially when you return to the chapter for review.

In this chapter, you will learn:

  • To identify under which circumstances overfitting likely occurs (see Section 6.3).

  • To apply a Polynomial Model to predict house prices (see Section 6.4).

  • How to explain overfitting in detail (see Section 6.4).

  • How overfitting can compromise the prediction quality for new data (see Section 6.4).

  • How to use hyper-parameter tuning to avoid overfitting (see Section 6.5).

  • How to work with the 10-Step Tuning Template to tune hyper-parameters for various types of machine learning models (see Section 6.5).

  • How to work with a real-world dataset and apply the tuning template from Section 6.5 to find the best value for the hyper-parameter \(k\) in a k-Nearest Neighbors model (see Section 6.7).

6.2 R Packages Required for the Chapter

This section lists the R packages that you need when you load and execute code in the interactive sections in RStudio. Please install the following packages using Tools -> Install Packages \(\dots\) from the RStudio menu bar (you can find more information about installing and loading packages in Section 3.4):

  • The rio package (Chan et al. (2021)) to enable the loading of various data formats with one import() command. Files can be loaded from the user’s hard drive or the Internet.

  • The janitor package (Firke (2023)) to rename variable names to UpperCamel and to substitute spaces and special characters in variable names.

  • The tidymodels package (Kuhn and Wickham (2020)) to streamline data engineering and machine learning tasks.

  • The kableExtra (Zhu (2021)) package to support the rendering of tables.

  • The learnr package (Aden-Buie, Schloerke, and Allaire (2022)), which is needed together with the shiny package (Chang et al. (2022)) for the interactive exercises in this book.

  • The shiny package (Chang et al. (2022)), which is needed together with the learnr package (Aden-Buie, Schloerke, and Allaire (2022)) for the interactive exercises in this book.

  • The kknn package (Schliep and Hechenbichler (2016)) to run k-Nearest Neighbors models.

6.3 The Problem of Overfitting

Machine learning aims to develop models that can be used to predict outcomes in the future. However, data scientists can only develop machine learning models based on data from the past (training data).36 They use the training data to calibrate a machine learning model`s parameters.

This approach is not without problems. When too successfully calibrating a machine learning model to the training data (i.e., the error based on the training data is very small), the model is extremely specialized in approximating the training data, but it fails with new observations that are not part of the training dataset. This is the core problem of overfitting.

To avoid overfitting while in the stage of model development by adjusting the model design (i.e., choosing hyper-parameter values) is not easy. When we cannot use the testing data to asses different designs, since the testing data can only be used to evaluate the finalized model.

Without being able to use the testing data, a second-best approach to minimize overfitting is Cross-Validation. This procedure utilizes the training data and randomly chooses part of the training data as a holdout validation dataset, to validate different model designs. This process is repeated on a rolling base until each observation was assigned once to the validation dataset (more about Cross-Validation in Section 6.5.2).

Circumstances that Can Lead to Overfitting

Identifying conditions that make overfitting more likely helps with developing strategies to avoid it. In general, overfitting is more likely to occur:

  1. When the training dataset does not have a sufficient number of observations.
  2. When the model considers many variables and consequently contains many parameters to calibrate.
  3. When the underlying machine learning model is highly non-linear.

6.4 Demonstrating Overfitting with a Polynomial Model

To demonstrate how overfitting occurs and which problems result from overfitting, we will use a Polynomial Model to predict housing prices.

As in the interactive Section 5.5, we utilize the King County House Sale dataset (Kaggle (2015)) and split the data into training and testing data:

library(tidymodels); library(rio); library(janitor)
DataHousing=
  import("https://ai.lange-analytics.com/data/HousingData.csv") |>
  clean_names("upper_camel") |>
  select(Price, Sqft=SqftLiving)

set.seed(777)
Split001=DataHousing |> 
         initial_split(prop=0.001, strata=Price, breaks=5) 
DataTrain=training(Split001)
DataTest=testing(Split001)

Note that the argument prop=0.001 in the initial_split() command assigns only 20 observations to the training data. The remaining 21,593 observations will be used as testing data. The reason to consider only 20 observations for training, although enough observations are available to create a bigger training dataset, is that we will purposely create circumstances that can lead to an overfitting scenario. In real-world analysis, where we work with bigger training datasets, overfitting is often more subtle and difficult to identify.

We use the prediction equation below to estimate the price of a house based on living square footage \((Sqft)\). The model is polynomial because \(Sqft\) is considered with various powers:

\[\begin{equation} \widehat{Price}=\beta_1 Sqft+\beta_2 Sqft^2+\beta_3 Sqft^3+\beta_4 Sqft^4+\beta_5 Sqft^5+\beta_6 \tag{6.1} \end{equation}\]

Since the price of a house \((\widehat{Price})\) is estimated based only on its square footage, the model is classified as univariate. A univariate model was chosen because it will allow us to present the results in a 2D diagram with the price on the vertical and square footage on the horizontal axis.37

As you see in Equation (6.1) the predictor variable \(Sqft\) is used with powers ranging from \(1\)\(5\), which makes the model non-linear.38

If you look at the circumstances that likely lead to overfitting in the info box at the end of the previous section, you can see that all conditions are fulfilled here:

  1. The training dataset does not have a sufficient number of observations (\(20\) observations in our case).

  2. The model contains many variables and, thus many parameters to calibrate (five variables and six parameters to calibrate is usually not considered many, but compared to only 20 observations the number of parameters can be considered as large).

  3. The model is highly non-linear (a polynomial of degree 5 is highly non-linear).

To create the Polynomial Model with tidymodels we use the same model design as in the interactive Section 5.5. To introduce non-linearity we use the recipe (the data):

ModelDesignLinRegr=linear_reg() |> 
                   set_engine("lm") |> 
                   set_mode("regression")

RecipeHouses=recipe(Price~., data=DataTrain) |> 
             step_mutate(Sqft2=Sqft^2, Sqft3=Sqft^3, 
                         Sqft4=Sqft^4,Sqft5=Sqft^5)

The recipe utilizes step_mutate() to create four additional variables that are calculated as the square, cubic, quartic, and quintic of the variable \(Sqft\):

  • Sqft2=\(Sqft^2\),
  • Sqft3=\(Sqft^3\),
  • Sqft4=\(Sqft^4\)
  • Sqft5=\(Sqft^5\).

Because \(Sqft\) is not only used in its original form, but also with various powers (squared, cubic, quartic, and quintic), the model is non-linear (in the data).

When looking at Equation (6.1) you can see why the model is linear in its parameters: If we treat the variables \(Sqft, Sqft^2,\dots, Sqft^5\) the same as any other variable in a multivariate linear OLS model, then each variable is multiplied by a parameter \(\beta_1\)\(\beta_5\) and added to the equation. Thus Equation (6.1) is a linear function as long as we interpret the different powers of \(Sqft\) as separate variables.

A Polynomial Model is linear in parameters but non-linear in data.

Consequently, the Polynomial Model from Equation (6.1) can still be optimized the same way as a regular OLS model because it is still linear in its parameters, when we treat each power of \(Sqft\) as separate variables.

We optimize by adding the recipe and the linear model design to the workflow WFModelHouses, which is then fitted to the 20 observations in the training dataset with fit(DataTrain):

WFModelHouses=workflow() |> 
              add_recipe(RecipeHouses) |>
              add_model(ModelDesignLinRegr) |> 
              fit(DataTrain)

Since the workflow WFModelHouses is fitted to the training data, we can use it to predict and measure the model’s predictive performance. We start with measuring the performance regarding the training data.

As before, we use the augment() command to append the predictions to the training dataset and then we use the metrics() command to calculate predictive performance metrics (see Sections 4.7.6 and 4.7.7 for details):

DataTrainWithPred=augment(WFModelHouses, DataTrain)
metrics(DataTrainWithPred, truth=Price, estimate=.pred)
## # A tibble: 3 × 3
##   .metric .estimator  .estimate
##   <chr>   <chr>           <dbl>
## 1 rmse    standard   136432.   
## 2 rsq     standard        0.715
## 3 mae     standard   104047.

The metrics() command by default calculates the root mean square error (rmse), \(r^2\) (rsq), and the mean average error (mae), when provided with the column names for the estimate (.pred) and the truth (Price) in the data frame DataTrainWithPred.

You can see that based on the training data, the Polynomial Model performs well. For example, based on the mae, the model under/overestimates on average by $104,000 (for comparison, a linear model with \(Sqft\) as the only predictor variable would create a mean average error of $139,000).

The good results for the Polynomial Model based on the training data are also confirmed in Figure 6.1 where we plotted the 20 training data observations (red) together with a linear prediction function (blue) and the prediction function for the Polynomial Model (magenta).

This figure shows a graph with a blue trendline and magenta polynomial curve.

FIGURE 6.1: Approximating Training Data with Polynomial (Degree 5)

You can see that the magenta line (the Polynomial Model) approximates the training data better than the blue line (the linear model). This is because the non-linearity of the Polynomial Model gives the magenta line more flexibility. The magenta line bends downwards to better predict lower-priced houses between 1,300 sqft and 2,200 sqft. Then it bends upwards to approximate higher priced houses with square footage between 2,500 sqft and 4,000 sqft. Finally, it bends down again to almost perfectly approximate the low-priced outlier house point with about 4,500 sqft.

In contrast, the blue line representing the linear benchmark cannot bend and therefore is located in the center of the training data.

The question remains, if the Polynomial Model (the magenta line in Figure 6.1) can also predict the testing data well.

To generate predictive metrics for the testing data, we again use the augment() and the metrics() commands, but this time we provide the data frame DataTest to the augment() command::

DataTestWithPred=augment(WFModelHouses, DataTest)
metrics(DataTestWithPred, truth=Price, estimate=.pred)
## # A tibble: 3 × 3
##   .metric .estimator     .estimate
##   <chr>   <chr>              <dbl>
## 1 rmse    standard   99940240.    
## 2 rsq     standard          0.0215
## 3 mae     standard    1719470.

You can see that the Polynomial Model that performed so well on the training data performs poorly on the testing data. The mean average error (mae) shows that the model under/overestimates on average by $1,719,000 (!!!) based on the testing data (for comparison, a linear model with \(Sqft\) as the only predictor variable would create a mean average error of $168,000 based on the testing data).

Why overfitting occurred and why the testing data were predicted so poorly by the Polynomial Model can be seen in Figure 6.2. The figure shows the training data (red dots), the testing data (small black dots) together with the prediction functions for the Polynomial Model (magenta line), and the linear benchmark (blue line).

This figure shows a graph with a cluster of data along with the two trend lines.

FIGURE 6.2: Model Performance on Testing and Training Data

The flexibility of the magenta line (the non-linearity of the Polynomial Model) allows to approximate the training data very well, but the strong focus on the training data fails to represent the general trend of the remaining data (testing data). This is exactly the problem of overfitting.

In contrast, the blue linear line cannot bend (the linear one-predictor benchmark model can only produce a straight line). This prevents the model from overfitting (over-approximating the training data).

The problems with overfitting can further be demonstrated when extending the Degree-5 Polynomial Model to a Degree-10 Polynomial Model. This means that in the prediction equation, exponents all the way up to 10 are considered for \(Sqft\):

\[\widehat{Price}=\beta_1 Sqft+\beta_2 Sqft^2+\beta_3 Sqft^3 +\beta_4 Sqft^4 + \cdots +\beta_{9} Sqft^{9}+\beta_{10} Sqft^{10}+\beta_{11}\]

Thanks to the tidymodels package, it is fairly easy to repeat the analysis for a Degree-10 Polynomial Model by only modifying the recipe. Instead of using step_mutate() to generate new predictor variables with various powers of \(Sqft\), we now use step_poly() to generate 10 new variables for the 10 different powers of \(Sqft\):

RecipeHousesPoly10=recipe(Price~., data=DataTrain) |> 
                   step_poly(Sqft, degree=10, 
                             options=list(raw=TRUE))

By default (options=list(raw=FALSE)) the command step_poly(), transforms the calculated polynomials to orthogonal polynomials, which allows for better interpretation of the regression parameters (\(\beta s\)). Since the Polynomial Regression model we use here is too complex to allow any interpretation of its parameters and since orthogonal polynomials exceed the scope of this book, we will work with the original polynomials in what follows (options=list(raw=TRUE)). The predictions are the same regardless of whether you use original or orthogonal polynomials.39

If we graph the resulting prediction function of the Degree-10 Polynomial Model (magenta line in Figure 6.3), you can see the complete disaster of extreme overfitting.

This figure shows a graph with a cluster of data points and a predicted magenta curve that doesn't fit the data very accurately.

FIGURE 6.3: Model Performance on Testing and Training Data

The Degree-10 Polynomial Model represented by the magenta line approximates the 20 training observations (red dots) almost perfectly but totally fails to predict the testing data (small black dots).

For example, the Degree-10 Polynomial Model predicts extremely high housing prices for houses with small square footage, predicts negative housing prices for houses with square footage in the ranges of 1,000 – 1,200 sqft and 2,700 – 3,700 sqft, and predicts extremely high prices for houses with 3,900 – 4,400 sqft.

From Figure 6.3, it is obvious to see why overfitting occurred. However, in almost all real-world cases we will not be able to generate a graph like in Figure 6.3. Therefore, comparing performance between training and testing data is the only way to identify overfitting.

Overfitting Summary

Overfitting occurs when a highly non-linear (flexible) model adjusts almost perfectly to a relatively small number of training observations, but the adjustment is so specific that the model fails when predicting the testing data.

Overfitting is sometimes compared to human learning behavior, when somebody learns facts by heart rather than understanding the underlying theory.

If — after the completion of model development — a comparison of training and testing data reveals overfitting, the model development needs to start from scratch.

The new model development includes generating new sets of training and testing data (change of the value in set.seed()). Otherwise, we risk extending the overfitting from the training to the testing data.

6.5 Tuning the Complexity of a Polynomial Model

In the previous section, overfitting occurred in the Degree-5 Polynomial Model and to an extreme degree in the Degree-10 Polynomial Model. The linear model (blue line in Figures 6.2 and 6.3) seemed to be superior to the Degree-5 and Degree-10 Polynomial Model.

However, this does not mean a linear model is always the best choice. For example, when the underlying pattern that generated the data is non-linear, a linear regression model is sub-par to fit the non-linear pattern.

This raises the question:

How can we choose the right degree for a Polynomial Model?

6.5.1 Hyper-Parameters vs. Model Parameters

The appropriate degree for a Polynomial Model cannot be found with the Optimizer that calibrates the \(\beta\)-parameters (model parameters) because in order to run the Optimizer, we have to choose the model first — including the polynomial degree. Therefore parameters like the polynomial degree, that need to be determined before the model calibration, are called hyper-parameters rather than model parameters.

Hyper-Parameters

  1. Hyper-parameters are parameters that change the design of a machine learning model or the data pre-processing in a recipe. For example, the hyper-parameter \(k\) in a k-Nearest Neighbors model design or the degree of a Polynomial Model in a recipe.

  2. Hyper-parameters often control the degree of non-linearity, and thus the flexibility of the prediction function. Therefore, we often face a trade-off between model flexibility and the risk of overfitting.

  3. Hyper-parameters cannot be determined with an Optimizer based on training data like a model’s \(\beta\) parameters. Therefore, we must utilize a trial-and-error process to find suitable hyper-parameters. This process is called hyper-parameter tuning.

  4. We cannot utilize the testing data for hyper-parameter tuning because this can extend overfitting into the testing data!

Finding the best hyper-parameter values is not specific to Polynomial Models only; we ran into this challenge already in Chapter 4 when deciding on the parameter \(k\) (the number of neighbors to consider) for k-Nearest Neighbors. Later, in Chapter 9, when covering Neural Network models, we have to decide how many neurons to consider when building a Neural Network. The number of neurons in a Neural Network is also considered a hyper-parameter. In fact, most machine learning models require finding best hyper-parameters at the model design stage. This process is called hyper-parameter tuning.

6.5.2 Creating the Tuning Workflow

In this and the following sections, we will tune the hyper-parameter for the polynomial degree of the model covered in Section 6.4 using the same dataset. The only exception is the number of observations assigned to the training data. Instead of using an extremely small training dataset, we will use a more realistic training dataset size. In the code block below, we again import the King County House Sale dataset (Kaggle (2015)), but now split the 21,613 observations into 80% training and 20% testing data:

library(tidymodels); library(rio); library(janitor)

DataHousing=
  import("https://ai.lange-analytics.com/data/HousingData.csv") |>
  clean_names("upper_camel") |>
  select(Price, Sqft=SqftLiving)

set.seed(987)

Split80=DataHousing |> 
        initial_split(prop=0.8, strata=Price, breaks=5) 
DataTrain=training(Split80)
DataTest=testing(Split80) 

Building the workflow for the analysis follows almost the same steps as in the previous section. We create a recipe and a model design. Afterward, we add both to a workflow:

RecipeHousesPolynomOLS=recipe(Price~., data=DataTrain) |> 
                       step_poly(Sqft, degree=tune(), 
                                 options=list(raw=TRUE))

ModelDesignLinRegr=linear_reg() |> 
                   set_engine("lm") |> 
                   set_mode("regression")

TuneWFModelHouses=workflow() |> 
                  add_model(ModelDesignLinRegr) |> 
                  add_recipe(RecipeHousesPolynomOLS)

However, there are two differences compared to the workflow in Section 6.4:

  1. In step_poly() where the argument degree= determines the highest power of \(Sqft\) in the prediction equation, we do not assign a number for the degree. This makes sense because the aim of hyper-parameter tuning is to find this number (the degree of the Polynomial Model).

    Since the argument degree= needs somehow to be determined, we use tune() as a placeholder. It is important not to over-interpret the meaning of tune(). It is only a placeholder that, later, will get replaced by the values for degree when we try and evaluate different values for degree.

  2. The workflow does not contain a fit() command to fit the model parameters to the training data. This also makes sense: Because the degree for the polynomial function is not determined, fitting the model is not possible. Consequently, the workflow cannot be used for predictions. It is only a blueprint for the tuning process that we will later perform. To clarify that the workflow is used for tuning only, the related R object name is prefixed with the word Tune (TuneWFModelHouses).

In order to evaluate the performance of several different degrees for the Polynomial Model, we have to decide which degrees we would like to try out. This is an arbitrary decision.

Nevertheless, the tidymodels package can still provide some guidance. For most hyper-parameters a related command exists that returns a recommended hyper-parameter value range. The name of the command is often the same as the name of the hyper-parameter. For example, you can use the command degree() to find a recommended range for the hyper-parameter degree:

degree()
## Polynomial Degree (quantitative)
## Range: [1, 3]

The command returns a recommended range for the hyper-parameter degree from \(1\)\(3\). We will extend this range and evaluate polynomial degrees from \(1\)\(10\).

For tuning purposes, the tidymodels package expects a data frame with the values for each hyper-parameter in the columns. The column names must be the same as the name of the respective hyper-parameter. Since we tune only one hyper-parameter in this case, the data frame ParGridHouses contains only one column named Degree with the values from \(1\)\(10\):

ParGridHouses=data.frame(degree=c(1:10))
print(ParGridHouses)
##    degree
## 1       1
## 2       2
## 3       3
## 4       4
## 5       5
## 6       6
## 7       7
## 8       8
## 9       9
## 10     10

Later, during the tuning process, the hyper-parameter values above will be pushed one by one to the tuning workflow TuneWFModelHouses. Each value will fill in for the placeholder tune() and the workflow will be fitted. Next, its predictive quality will be evaluated. The workflow (the polynomial degree) with the best performance constitutes the best model.

6.5.3 Validating the Tuning Results

To find the best model, we need to validate each hyper-parameter value from the tuning results (each hyper-parameter combination in case we tune more than one hyper-parameter). This raises the question:

Which dataset should be used to validate the tuning results?

You might be tempted to use the testing dataset to validate different values for hyper-parameters. However, keep in mind that the testing dataset should never be used for any type of optimization — including hyper-parameter tuning.

If you ignore this rule, you might get a good performance on the testing data. But because now the optimization is specialized for the testing data, you likely will get poor predictive results in the production stage when you confront your model with new data, which it had never seen before. This would be an example of pushing the overfitting problem from the training to the testing data.

Using the complete training dataset to find the best hyper-parameter value is also not an option because the best performing hyper-parameter value would be the one that triggers the highest degree of overfitting.

In what follows, we will introduce two strategies to assess various values of hyper-parameters without using the testing dataset or the complete training dataset:40

Validation Dataset

One option is to randomly choose a number of observations from the training dataset, exclude them from training, and assign them to an additional holdout dataset called the validation dataset.

The validation dataset will never be used for training. Instead, the observations in the validation dataset are set aside to assess the predictive performance for different hyper-parameter values.

A validation dataset is very similar to a testing dataset since both are used to assess the performance of a specific model. The difference is that the assessment is performed at different stages of development. While the validation dataset is used during the model design stage to find the best hyper-parameter values, the testing dataset is used after the development of the machine learning model is finalized to assess overall predictive quality.

The validation_split() command in the code block below can be used to split the training data into observations that are used for training (analysis observations) and those that are used to assess the predictive performance for various hyper-parameter settings (assessment observations). The command validation_split() is similar to the command initial_split(). The argument prop=0.85 determines the percentage of observations leftover for training and the argument strat=Price ensures that different housing price levels are proportionally distributed between training and assessment. The splitted observations are then stored in the data frame DataValidate:

set.seed(879)
DataValidate=validation_split(DataTrain, prop = 0.85, strat=Price)

The resulting data frame DataValidate includes the complete training dataset, but observations are internally earmarked with analysis to indicate that an observation will be used for training, and with assessment to indicate that the observation will be used as validation data to assess hyper-parameter performance.

Using a validation dataset to compare hyper-parameter performance is appropriate for large datasets. For smaller datasets, there are a couple of disadvantages:

  1. Excluding observations from the training process and earmarking them for hyper-parameter assessment reduces the number of observations that are available for training.

  2. The observations used for the assessment of hyper-parameters are randomly chosen. This bears the risk that, by accident, an unusual validation dataset might be created (the risk is higher for smaller training datasets). Evaluating hyper-parameters based on unusual assessment observations might lead to a sub-par choice of hyper-parameter values.

Cross-Validation

Instead of using one dataset where observations are earmarked for training or hyper-parameter assessment, the Cross-Validation procedure creates multiple training/assessment datasets called folds or resamples. These folds differ only by which observations are chosen for training and which ones are used for hyper-parameter assessment. Figure 6.4 shows the basic idea behind Cross-Validation for four folds.

This figure shows a diagram of the basic idea behind Cross-Validation.

FIGURE 6.4: The Basic Idea Behind Cross-Validation

Cross-Validation shuffles the training dataset and then copies it \(N\) times, assigning each copy to one of \(N\) folds. Each of these folds has a different set of observations excluded from the training and used for the assessment of the various hyper-parameter combinations.

Figure 6.4 shows an example for four folds. The shuffled training dataset is copied four times into Folds 1 – 4. In Fold 1, the last quarter of observations, is assigned to the assessment dataset. The remaining observations are used for training. In Fold 2, the third quarter of observation is designed to the assessment dataset, and the remaining observations are used for training. In Fold 3, the second quarter of observations is assigned to the assessment dataset, and in Fold 4, the first quarter. This assures that every observation is exactly used once in an assessment dataset.

When a model is tuned, each of the hyper-parameter values is assessed for all four folds (requires training of the model for each fold). The overall performance for a hyper-parameter value is calculated as the mean performance of the assessment observations in folds 1, 2, 3, and 4. The same process is then repeated for the other hyper-parameter values.

It is common to choose ten folds if the training dataset is sufficiently big. For smaller datasets, a lower number of folds can be selected. To compensate for a low number of folds, the process of shuffling the training data, creating folds, and training/assessing the models can be repeated several times. This requires setting the repeat= argument for the related vfolds_cv() command to a value \(>1\) (the default is repeat=1).

The advantage of Cross-Validation is that different sets of observations are used for assessment (the mean prediction error is used to assess overall performance) and all observations of the training data at some stage of model assessment are used for validation. Therefore, the risk of an unusual assessment dataset is mitigated.

The disadvantage of Cross-Validation is that each hyper-parameter setup needs to be trained and assessed separately for each of the folds. Computation time increases exponentially with the number of hyper-parameters tried out and proportionally with the number of folds used.

Using the R code block below, you can create the Cross-Validation folds for our Polynomial Model:

set.seed(987)
FoldsHouses=vfold_cv(DataTrain, v=4, strata=Price)

For simplicity reasons we generate only four folds, although our dataset would be big enough to choose the common 10-fold setup. The command vfold_cv() creates the four folds (see the argument v=4). The strata argument ensures that the different house prices are proportionally represented in the various assessment folds.

6.5.4 Executing the Tuning Process

Now that we have stored the four folds for training and hyper-parameter assessment in the R object FoldsHouses and the hyper-parameters to be tried out in ParGridHouses, we can use the command tune_grid() to evaluate each of the ten parameter values (degree 1 – 10). Given four folds and ten parameter values to evaluate, the tune_grid() command has to train and assess a total of 40 model/data variations.

The command tune_grid() executes the tuning process (see the R code block below). It requires the name of the tuning workflow (TuneWFModelHouses) as the first argument. Then the data frame with the hyper-parameter values to be tried out must be provided with the grid= argument (in our case: grid=ParGridHouses). Finally, an argument for the R object that holds the folds for training and assessment (resamples=FoldsHouses) is required. The metrics argument is optional. In the R code below, the argument metrics=metric_set(rmse,rsq,mae) determines that performance metrics for the root mean squared error (rmse), \(r^2\) (rsq), and the mean average error (mae) are calculated for each parameter value and for each fold:41

TuneResultsHouses=tune_grid(TuneWFModelHouses, resamples=FoldsHouses, 
                            grid=ParGridHouses, 
                            metrics=metric_set(rmse,rsq,mae))

Tuning a workflow can take a while, from a few seconds to a day or more depending on the number of hyper-parameters to tune and the number of folds used for validation. In the code block above, the results from tune_grid() for each parameter value, each fold, and each performance metric are saved in the R object TuneResultsHouses.

There are several ways to extract information from the R object TuneResultsHouses. For example, you can use the command autoplot() to create a graphical overview of the performance for the different hyper-parameter values (see Figure 6.5).

This figure shows three graphs that represent mae, mse, rsq during tuning.

FIGURE 6.5: Performance Metrics During Tuning

The plots in Figure 6.5 indicate that a linear equation (degree=1) performs not as well as some of the polynomials with a degree>1.

The performance for degree 2 – 8 is similar for the three metrics (with the exception of degree=7 for the rsq metric).

Polynomials with degrees 9 and 10 have a poor predictive performance based on the assessment from the Cross-Validation folds.

To see more details, we extract the hyper-parameter value rankings from the workflow TuneResultsHouses based on the three performance measures. We start with the best five hyper-parameters using the metric root mean squared error (rmse):

show_best(TuneResultsHouses, metric="rmse")
## # A tibble: 5 × 7
##   degree .metric .estimator    mean     n std_err
##    <int> <chr>   <chr>        <dbl> <int>   <dbl>
## 1      6 rmse    standard   251993.     4   6179.
## 2      8 rmse    standard   252979.     4   5971.
## 3      2 rmse    standard   255965.     4   7243.
## 4      3 rmse    standard   257680.     4   7875.
## 5      4 rmse    standard   260994.     4   9911.
## # ℹ 1 more variable: .config <chr>

The best (lowest) rmse was 251,993 for a polynomial degree of 6. The linear model (degree=1) did not make it to the top five.

The ranking of the five best-performing models based on \(r^2\) is the same as the one for rmse:

show_best(TuneResultsHouses, metric="rsq")
## # A tibble: 5 × 7
##   degree .metric .estimator  mean     n std_err .config
##    <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>  
## 1      6 rsq     standard   0.522     4  0.0304 Prepro…
## 2      8 rsq     standard   0.521     4  0.0110 Prepro…
## 3      2 rsq     standard   0.513     4  0.0277 Prepro…
## 4      3 rsq     standard   0.508     4  0.0226 Prepro…
## 5      4 rsq     standard   0.498     4  0.0295 Prepro…

The ranking for mean average error mae is slightly different. However, the best-performing model is still the Degree-6 Polynomial Model:

show_best(TuneResultsHouses, metric="mae")
## # A tibble: 5 × 7
##   degree .metric .estimator    mean     n std_err
##    <int> <chr>   <chr>        <dbl> <int>   <dbl>
## 1      6 mae     standard   165798.     4   1924.
## 2      8 mae     standard   165868.     4   1764.
## 3      4 mae     standard   166325.     4   2137.
## 4      3 mae     standard   166434.     4   1863.
## 5      2 mae     standard   166546.     4   1840.
## # ℹ 1 more variable: .config <chr>

Since all three performance measures ranked a Polynomial Model of degree 6 as the best model, it does not matter which performance measure we choose to extract the degree for the best model. For example, to extract the best-performing hyper-parameter to minimize rmse, we can use:

BestHyperPar=select_best(TuneResultsHouses, metric="rmse")
print(BestHyperPar)
## # A tibble: 1 × 2
##   degree .config              
##    <int> <chr>                
## 1      6 Preprocessor06_Model1

The printout of BestHyperPar shows that the best-performing hyper-parameter value is saved in the data frame column degree as the first and only entry.

We will use this data frame to create a model (the best one) with the best hyper-parameter (degree=6).

To do this, we add the best hyper-parameter to the tune workflow with finalize_workflow(). This command substitutes the tune() placeholder with the optimal hyper-parameter value that is stored in BestHyperPar. Afterward, we add the fit() command to the pipe to train the workflow model:

BestWFModelHouses=TuneWFModelHouses |> 
                  finalize_workflow(BestHyperPar) |> 
                  fit(DataTrain)
print(BestWFModelHouses)
## ══ Workflow [trained] ═══════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ─────────────────────
## 1 Recipe Step
## 
## • step_poly()
## 
## ── Model ────────────────────────────
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
## (Intercept)  Sqft_poly_1  Sqft_poly_2  Sqft_poly_3  
##   -1.30e+04     6.57e+02    -4.90e-01     2.05e-04  
## Sqft_poly_4  Sqft_poly_5  Sqft_poly_6  
##   -3.76e-08     3.18e-12    -9.93e-17

The printout above from the fitted workflow confirms that WFModelHouses is a fitted workflow because it contains the values for the estimated model-parameters (see Coefficients). Consequently, WFModelHouses can be used for predictions. We use the augment() command to predict based on the testing dataset and the metrics() command to calculate the related metrics:

DataTestWithPredBestModel=augment(BestWFModelHouses, DataTest)
metrics(DataTestWithPredBestModel, truth=Price, estimate=.pred)
## # A tibble: 3 × 3
##   .metric .estimator  .estimate
##   <chr>   <chr>           <dbl>
## 1 rmse    standard   240706.   
## 2 rsq     standard        0.586
## 3 mae     standard   164987.
This figure shows a graph with a cluster of points, a magenta line, and a blue line.

FIGURE 6.6: Poynomial Degree-6 vs. Linear Prediction Functions

Given that we used a univariate model with \(Sqft\) being the only variate, the results look quite good. Based on the testing data \(r^2=0.5857\). The mean average error shows that the housing price is, on average, under/overestimated by $164,987 (for comparison, a linear model with \(Sqft\) as the only predictor variable would create a mean average error of $173,000 based on the testing data).

Figure 6.6 ilustrates why the Degree-6 Polynomial Model (magenta line) performs better than a linear regression (blue line) and why it does not lead to overfitting: Although a polynomial function of Degree-6 is potentially very flexible, you can see in Figure 6.6 that it differs only slightly from the linear prediction function for houses smaller than 3,000 sqft. For houses larger than 3,000 sqft, the Degree-6 prediction function estimates higher valued houses much better than the linear function.

6.6 10-Step Template to Tune with tidymodels

In the previous section, we used the tidymodels package to tune a polynomial machine learning model. Since tidymodels provides a unified analysis framework independent of the machine learning model, you can use the same set of commands for many other machine learning models.

In this section, we provide a 10-Step Template to make it easy for you to develop a complete machine learning analysis that includes tuning hyper-parameters and also assessing the final results based on the testing data. You can use the template for all machine learning models covered in this book and for other machine learning models not covered here as well.42

Below you will find the 10-Step Template together with sample code. In the Digital Resources section for this chapter (see Section 6.10), you will find an R script that contains the R code for the 10-Step Template together with an example.

Step 1 - Generate Training and Testing Data:

Note, it is assumed that the data frame MyData contains the data you are analyzing.

set.seed(987)
Split80=MyData |> 
    initial_split(prop=0.8, strata=<OUTCOME VARIABLE>, breaks=5) 
DataTrain=training(Split80)
DataTest=testing(Split80)

Substitute <OUTCOME VARIABLE> with the name of your outcome variable for the strata= argument. This ensures that the distribution of your outcome variable is similar in the training and testing data.

Step 2 - Create a Recipe:

In the code block below, substitute <OUTCOME VARIABLE> with your outcome variable and <PREDICTOR VARIABLE(S)> with a list of your predictor variables separated by “+”-signs.

Alternatively, you can use a “.” on the right of the “~”-sign to use all predictor variables from the related data frame.

Recipe=recipe(<OUTCOME VARIABLE>~<PREDICTOR VARIABLE(S)>,
              data=DataTrain) |>
       step_<NAME OF STEP>(<ARGUMENT(S) OF STEP>)

Note that step_<NAME OF STEP>() stands for an optional pre-processing step of the predictor variables. <ARGUMENT(S) OF STEP> represents optional arguments for the related step_ command.

If you plan to tune a hyper-parameter in a recipe you have to assign the tune() placeholder rather than a value to the related argument. For example, degree=tune() in step_poly().

You can find a list of available step commands together with their names at: https://recipes.tidymodels.org/reference.

Step 3 - Create a Model Design:

Substitute <NAME OF ML-COMMAND> with the command name for the related machine learning model and optional with arguments for <ARGUMENT(S) OF COMMAND>.

Then substitute <PACKAGE NAME> in the engine() command with the package name for the machine learning model and <MODE> in the set_mode() command with either regression or classification.

ModelDesign=<NAME OF ML-COMMAND>(<ARGUMENT(S) OF COMMAND>) |>
            set_engine("<PACKAGE NAME>") |> 
            set_mode("<MODE>")

If you plan to tune a hyper-parameter in a model design, you have to assign the tune() placeholder rather than a value to the related argument. For example, neighbors=tune() in the nearest_neighbor() command.

You can find the command names for various machine learning models together with the related package names for the set_engine() command at: https://parsnip.tidymodels.org/reference.

Step 4 - Add the Recipe and the Model Design to a Workflow:

In the code block below the recipe (named: Recipe) and the model design (named: ModelDesign) are added to the workflow TuneWFModel:

TuneWFModel=workflow() |>
            add_recipe(Recipe) |>     
            add_model(ModelDesign)
Step 5 - Create a Hyper-Parameter Grid:

The hyper-parameter values that need to be tried out must be listed in a data frame column named with the same name as the hyper-parameter:

ParGrid=data.frame(<HYPER-PAR1>=c(<LIST OF VALUES>),
               <HYPER-PAR2>=c(<LIST OF VALUES>), <ETC>) 

The data.frame() command is one way to create the required data frame. <HYPER-PAR1> is the name of a hyper-parameter, and the c() command can be utilized to provide a list of values. If you tune more than one hyper-parameter, add these in the same way.

If you tune only one hyper-parameter, for example, only the number of neighbors in a k-Nearest Neighbors model, the code could look like this:

ParGrid=data.frame(neighbors=c(1,3,6))
Step 6 - Create Resamples for Cross-Validation:

To create the resamples for Cross-Validation, you can use the following code:

FoldsForTuning=vfold_cv(DataTrain, v=10, strata=<OUTCOME VARIABLE>)

A typical Cross-Validation setup includes ten folds. If you like to work with a smaller number of folds, especially for smaller datasets, change v=10 accordingly to reflect the number of folds.

To ensure the outcome variable is similarly distributed in each section of the folds, substitute <OUTCOME VARIABLE> with the name of your outcome variable.

Step 7 - Tune the Workflow and Train All Models:

The command tune_grid() trains models for all hyper-parameter combinations stored in the data frame ParGrid using all resamples stored in the previous step in FoldsForTuning.

TuneResults=tune_grid(TuneWFModel, resamples=FoldsForTuning,
grid=ParGrid, metrics=metric_set(<LIST OF METRICS>)), 
control_grid(verbose=TRUE))

The optional metrics argument specifies the metrics that are calculated. For example, substitute metric_set(<LIST OF METRICS>) with metric_set(rmse, rsq, mae) for a regression or with metric_set(accuracy, sensitivity, specificity) for a classification problem.

The argument control_grid(verbose=TRUE) is optional. When used like here with verbose=TRUE, the tuning reports its progress to the R Console.

Step 8 - Extract the Best Hyper-Parameter(s):

Because all assessment results for the specified metrics are stored in the tuning object TuneResults, we can use select_best() to extract the best hyper-parameter(s) for the metric we are interested in:

BestHyperPar=select_best(TuneResults, metric="<METRIC>")

You need to specify which metric should be used to identify the best-performing model by substituting <METRIC> with the metric of your choice. Note that only metrics specified previously in Step 7 can be chosen.

Step 9 - Finalize and Train the Best Workflow Model:

You can use the command finalize_workflow() to substitute the tune() in TuneWFModel with the values from BestHyperPar. Afterward, the command fit(DataTrain) trains the finalized workflow model with the training data:

BestWFModel=TuneWFModel |>
            finalize_workflow(BestHyperPar) |>
            fit(DataTrain)
Step 10 - Assess Prediction Quality Based on the Testing Data:

This step should only be performed after the model is completed and no further changes are planned because otherwise, you cannot use the testing data.

DataTestWithPredBestModel=augment(BestWFModel, DataTest)
metrics(DataTestWithPredBestModel, truth=<OUTCOME VARIABLE>, 
    estimate=.pred)

The augment() command writes the predictions into a column named .pred and augments the testing data frame with that column. The resulting data frame is saved as DataTestWithPredBestModel.

Since the metrics() command needs to compare these predictions with the true values to calculate the metrics, the name of the outcome variable also needs to be provided by substituting <OUTCOME VARIABLE> accordingly.

6.7 🧭Project: Tuning a k-Nearest Neighbors Model

Interactive Section

In this section, you will find content together with R code to execute, change, and rerun in RStudio.

The best way to read and to work with this section is to open it with RStudio. Then you can interactively work on R code exercises and R projects within a web browser. This way you can apply what you have learned so far and extend your knowledge. You can also choose to continue reading either in the book or online, but you will not benefit from the interactive learning experience.

To work with this section in RStudio in an interactive environment, follow these steps:

  1. Ensure that both the learnR and the shiny package are installed. If not, install them from RStudio’s main menu (Tools -> Install Packages \(\dots\)).

  2. Download the Rmd file for the interactive session and save it in your project folder. You will find the link for the download below.

  3. Open the downloaded file in RStudio and click the Run Document button, located in the editing window’s top-middle area.

For detailed help for running the exercises including videos for Windows and Mac users we refer to: https://blog.lange-analytics.com/2024/01/interactsessions.html

Do not skip this interactive section because besides providing applications of already covered concepts, it will also extend what you have learned so far.

Below is the link to download the interactive section:

https://ai.lange-analytics.com/exc/?file=06-TrainTestExerc100.Rmd

In Section 4.9, you used a k-Nearest-Neighbor model to predict the color of a wine. We arbitrarily set \(k=4\) to consider the four nearest neighbors.

In this section, you will work on the same problem with an interactive project, but you will tune the hyper-parameter \(k\) with Cross-Validation to find an optimal \(k\) (good approximation of the training data without overfitting). You will use the 10-Step Template from Section 6.6 to make it easy to setup the code.

In the Digital Resources section for this chapter (see Section 6.10) you find a link to a blog post that describes how to use the 10-Step Template in detail. The blog post also provides the R code for the template.

Step 1 - Generating Training and Testing Data:

As before, you use the wine dataset and split the data into training (DataTrain) and testing data (DataTest). Since you use the same value in the set.seed() command, the (random) split will be identical to the one we used before with the \(k=4\) Nearest Neighbor model. The code below has been executed already.

library(tidymodels); library(rio); library(janitor)
DataWine=import("https://ai.lange-analytics.com/data/WineData.rds") |> 
         clean_names("upper_camel") |> 
         rename(Sulfur=TotalSulfurDioxide) |> 
         mutate(WineColor=as.factor(WineColor)) 

set.seed(876)
Split7030=initial_split(DataWine, prop=0.7, strata=WineColor)
DataTrain=training(Split7030)
DataTest=testing(Split7030)

head(DataTrain)
##   WineColor Acidity VolatileAcidity CitricAcid
## 1       red    10.8           0.320       0.44
## 2       red     6.7           0.855       0.02
## 3       red     7.5           0.380       0.57
## 4       red     7.1           0.270       0.60
## 5       red     8.0           0.580       0.28
## 6       red     7.6           0.400       0.29
##   ResidualSugar Chlorides FreeSulfurDioxide Sulfur
## 1           1.6     0.063                16     37
## 2           1.9     0.064                29     38
## 3           2.3     0.106                 5     12
## 4           2.1     0.074                17     25
## 5           3.2     0.066                21    114
## 6           1.9     0.078                29     66
##   Density   PH Sulphates Alcohol Quality
## 1  0.9985 3.22      0.78   10.00       6
## 2  0.9947 3.30      0.56   10.75       6
## 3  0.9960 3.36      0.55   11.40       6
## 4  0.9981 3.38      0.72   10.60       6
## 5  0.9973 3.22      0.54    9.40       6
## 6  0.9971 3.45      0.59    9.50       6
Step 2 - Create a Recipe:

Here, you will create a recipe and store it in the R object Recipe. Use step_rm() to remove the predictor variable \(Quality\) because it is not related to \(WineColor\), and the command step_normalize() to normalize all remaining predictors. Please, substitute <THESE> placeholders accordingly and execute the code. Note that the data frame DataTrain has already been loaded in the background:

Recipe=recipe(WineColor~., data=DataTrain) |>
       step_<NAME OF STEP>(<ARGUMENT OF STEP>) |>           
       step_<NAME OF STEP>(all_predictors())
print(Recipe)
Step 3 - Create a Model Design:

Next, you will create the model design and store it into the R object ModelDesign. Since you plan to tune the argument neighbors= (stands for the \(k\) in Nearest Neighbors), you have to add it as an argument into the nearest_neighbor() command by substituting <ARGUMENT(S) OF COMMAND> with the argument and its value. Remember that you cannot set neighbors to a specific numerical value because you want to tune the hyper-parameter neighbors later on. Therefore you have to assign the placeholder tune() to the argument neighbors.

ModelDesign=nearest_neighbor(<ARGUMENT(S) OF COMMAND>) |>
            set_engine("kknn") |>
            set_mode("classification")
print(ModelDesign)
Step 4 - Add the Recipe and the Model Design to a Workflow:

The code block below adds the R object Recipe and the model design object ModelDesign to a workflow model named TuneWFModel. The R code has been executed already.

TuneWFModel=workflow() |> 
            add_recipe(Recipe) |>
            add_model(ModelDesign) 
print(TuneWFModel)
## ══ Workflow ═════════════════════════
## Preprocessor: Recipe
## Model: nearest_neighbor()
## 
## ── Preprocessor ─────────────────────
## 2 Recipe Steps
## 
## • step_rm()
## • step_normalize()
## 
## ── Model ────────────────────────────
## K-Nearest Neighbor Model Specification (classification)
## 
## Main Arguments:
##   neighbors = tune()
## 
## Computational engine: kknn

You can see in the printout above that the workflow is not finalized because the number of neighbors has not been set in the model design. The hyper-parameter neighbors is set to tune() instead and will later, in Step 7, be replaced in a trial and error process with different values for neighbors.

Step 5 - Create a Hyper-Parameter Grid:

Later, when tuning is executed in Step 7, values reaching from 1 – 15 for the hyper-parameter neighbors shall be tried out.

You need to provide these values in a data frame column that is named the same as the hyper-parameter. Below, replace the <LIST OF VALUES> to define a column neighbors in the data frame ParGrid. The column should contain values from 1 – 15:

ParGrid=data.frame(neighbors=c(<LIST OF VALUES>))
print(ParGrid)
Step 6 - Creating Resamples for Cross-Validation:

The values you have created above for \(k\) (hyper-parameter neighbors) will be evaluated later using five folds (resamples). Each fold contains the complete training data, but different sections are used for training and assessment in each fold.

Please create five folds below by substituting <NUMBER OF FOLDS> accordingly.

The folds will be saved in the R object FoldsForTuning:

set.seed(123)
FoldsForTuning=vfold_cv(DataTrain, v=<NUMBER OF FOLDS>, 
                        strata=WineColor)
print(FoldsForTuning)
Step 7 - Tune the Workflow and Train All Models:

Now it is time to run the tuning procedure using the tune_grid() command. Be patient because it will take some time to fully execute. Since we have to try out 15 parameters and use five folds for each model, the tune_grid() command has to fit 75 models \((15\cdot 5=75)\).

Please substitute <LIST OF METRICS> with a list of metrics to be calculated. Use the metrics accuracy, specificity, and sensitivity.

After the tuning is finished, all results are stored in the R object TuneResults, and they can be evaluated by different metrics commnands.

For example the command autoplot() provides a diagrammatic overview of the results for all three metrics.

TuneResults=tune_grid(TuneWFModel, resamples=FoldsForTuning,
grid=ParGrid, metrics=metric_set(<LIST OF METRICS>))
autoplot(TuneResults)

The three graphs that you will create in the exercise above are also displayed in Figure 6.7. They show for each metric the related accuracy, specificity, and sensitivity for all tried out hyper-parameters (neighbors). The results for the five folds are averaged.

This figure shows three graphs representing accuracy, sensitivity, and specificity related to the tuning parameter k.

FIGURE 6.7: Tuning Results for Various k Values

You can see that neighbors values between 1 and 4 produce the best results for accuracy (predicting red and white wines (positive and negative class) correctly) and sensitivity (predicting red wines (positive class) correctly). If you look at specificity (predicting white wines (negative class) correctly), you can see that \(k=5\) produces the best result. However, \(k=5\) also leads to a sharp decrease in sensitivity and a decrease in accuracy as well.

It seems reasonable to use the best result for accuracy, which means choosing a \(k\) between 1 and 4.

Step 8 - Extract the Best Hyper-Parameter(s):

All assessment results for the specified metrics are stored in the tuning object TuneResults. Choose the metric accuracy by substituting <METRIC> accordingly. Afterward, when you execute the code the select_best() command extracts the best hyper-parameter (value for neighbors) for the metric you specified:

BestHyperPar=select_best(TuneResults, metric="<METRIC>") 
print(BestHyperPar)

As you will see in the printout after executing the R code, the best value for the hyper-parameter neighbors based on accuracy is 1. It is saved in the data frame BestHyperPar.

You can try other metrics in the code block above and see if the result changes. Why would the metric rmse cause an error?

Step 9 - Finalize and Train the Best Workflow Model:

The code block below is executed already. The finalize_workflow() command used the value from BestHyperPar to substitute the tune() placeholder in the R object TuneWFModel (created in Steps 2 – 4). The hyper-parameter is now set to `neighbors=`1 completing the workflow. The command fit(DataTrain) calibrates the workflow to the training data and the result is saved into WFModelBest.

WFModelBest=TuneWFModel |> 
            finalize_workflow(BestHyperPar) |>  
            fit(DataTrain)

print(WFModelBest)
## ══ Workflow [trained] ═══════════════
## Preprocessor: Recipe
## Model: nearest_neighbor()
## 
## ── Preprocessor ─────────────────────
## 2 Recipe Steps
## 
## • step_rm()
## • step_normalize()
## 
## ── Model ────────────────────────────
## 
## Call:
## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(1L,    
## 
## Type of response variable: nominal
## Minimal misclassification: 0.009383
## Best kernel: optimal
## Best k: 1

At the end of the printout above, you can see that \((k)\) was set to 1 rather than tune(). You can also see that WFModelBest is fitted to the training data because the “Minimum missclassification” is reported (based on the training data).

Step 10: Assess Prediction Quality Based on the Testing Data:

Since WFModelBest is a fitted model, you can use it for predictions. In this last step, you use the augment() command to predict WineColor. The augment() command will then add the prediction results as column .pred to the testing data.

The conf_mat() command compares the predictions in column .pred to the true values to create a confusion matrix. But before, you have to substitute <OUTCOME VARIABLE> with the variable name (column name) for the outcome variable.

DataTestWithPredBestModel=augment(WFModelBest, DataTest)
conf_mat(DataTestWithPredBestModel, truth=<OUTCOME VARIABLE>, 
         estimate=.pred_class)

You will see in the confusion matrix that based on the testing data from 480 red wines, only six were misclassified. Likewise, from the 480 white wines, only seven were misclassified. Given that the classification was only based on the chemical properties of the wines, the results are excellent.

Most likely, a true wine expert might have reached a similar impressive result, but it would have taken them a long time to classify 960 wines.

6.8 When and When not to Use Polynomial Regression

  • Polynomial Regression is a straightforward but not a sophisticated machine learning procedure. Therefore, it should only be used for basic non-linear relationships where possible interactions between predictor variables are known.

  • Regular OLS models and basic Polynomial Models have the advantage that the coefficients are directly interpretable. This is not true anymore even for slightly more complex Polynomial Models.

    Since the advantage of direct coefficient interpretability is lost for more complex Polynomial models, it is recommended to use more powerful machine learning models such as Neural Networks (see Chapter 9) or tree based models like Random Forest (see Chapter 10) when analyzing complex regression problems.

6.9 When and When not to Use Tuning

  • Anytime a machine learning model has hyper-parameters from which you believe they have an impact on the predictive quality, you should use tuning.

  • Even if you have a small dataset, you can use tuning. Although the tuning procedures described here are not well suited for small datasets you can use procedures like Bootstrapping or Leave-One-Out (see Kuhn and Silge (2022) for more details) for smaller datasets.

  • Deciding which hyper-parameters to tune and how many values to try can be challenging. This is especially true when you tune more than one hyper-parameter. In that case, you have to try different combinations of the values for each hyper-parameter, and consequently, the number of models to tune can get very big very fast. The number of models the tuning has to fit equals the number of folds times the number of hyper-parameter value combinations. For example, if you have ten folds and three hyper-parameters with five values each, and you want to try out all combinations of these hyper-parameter values, you have to fit 1,250 model/data combinations \((10\cdot5\cdot5\cdot5=1,250)\).

6.10 Digital Resources

Below you will find a few digital resources related to this chapter such as:

  • Videos
  • Short articles
  • Tutorials
  • R scripts

These resources are recommended if you would like to review the chapter from a different angle or to go beyond what was covered in the chapter.

Here we show only a few of the digital resourses. At the end of the list you will find a link to additonal digital resources for this chapter that are maintained on the Internet.

You can find a complete list of digital resources for all book chapters on the companion website: https://ai.lange-analytics.com/digitalresources.html

Polynomial Regression Video

Mike X. Cohen provides a YouTube video that explains the basic idea of Polynomial Regression.

Link: https://ai.lange-analytics.com/dr?a=369

The Danger of Overfitting

This video by Cassie Kozyrkov, former Chief Decision Scientist at Google, explains why splitting data into training and testing data is important. She also explains why overfitting is a problem.

Link: https://ai.lange-analytics.com/dr?a=348

Supported Recipe Steps for Preprocessing

Here is a list of all recipe step_() commands that can be piped with |> to a recipe. The linked website will tell you which steps are available for which preprocessing purpose.

Link: https://ai.lange-analytics.com/dr?a=325

Supported Machine Learning Models from tidymodels

Here is a list of all supported tidymodels machine learning models. The linked website will tell you for each model:

  • the model name
  • the package name(s) for the set_engine() command
  • the hyper-parameters that you can tune

Link: https://ai.lange-analytics.com/dr?a=324

A 10-Step Template to Create, Tune, and Assess a Machine Learning Model with tidymodels

The link below will open a blog article by Carsten Lange. The article provides a tidymodels 10-step template for creating, tuning, and assessing machine learning models. The template is explained in detail and a link for downloading the related R script is provided.

Link: https://ai.lange-analytics.com/dr?a=323

More Digital Resources

Only a subset of digital resources is listed in this section. The link below points to additional, concurrently updated resources for this chapter.

Link: https://ai.lange-analytics.com/dr/traintest.html

References

Aden-Buie, Garrick, Barret Schloerke, and JJ Allaire. 2022. Learnr: Interactive Tutorials for r. https://CRAN.R-project.org/package=learnr.
Chan, Chung-Hong, Geoffrey C. H. Chan, Thomas J. Leeper, and Jason Becker. 2021. Rio: A Swiss-Army Knife for Data File i/o.
Chang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara Borges. 2022. Shiny: Web Application Framework for r. https://CRAN.R-project.org/package=shiny.
Delua, Julianna. 2021. “Supervised Vs. Unsupervised Learning: What’s the Difference?” IBM Blog. https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-learning.
Firke, Sam. 2023. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.
Kaggle. 2015. “House Sales in King County, USA.” Online. https://www.kaggle.com/datasets/harlfoxem/housesalesprediction.
Kuhn, Max, and Julia Silge. 2022. Tidy Modeling with r. A Framework for Modeling in the Tidyverse. O’Reilly, Sebastopol, CA. https://www.tmwr.org/.
Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.
Narula, Sabhash C. 1979. “Orthogonal Polynomial Regression.” International Statistical Review/Revue Internationale de Statistique 47 (1): 31–36. http://www.jstor.org/stable/1403204.
Schliep, Klaus, and Klaus Hechenbichler. 2016. Kknn: Weighted k-Nearest Neighbors. https://CRAN.R-project.org/package=kknn.
Zhu, Hao. 2021. kableExtra: Construct Complex Table with ’kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.

  1. This is at least true for supervised models. Unsupervised models like reinforcement models that can improve during the production stage exceed the scope of this book. In an IBM blog article Delua (2021) provides a brief comparison between supervised and un-supervised machine learning models.↩︎

  2. A multivariate real estate model will be covered in Chapter 7.↩︎

  3. Later in this chapter, Polynomial Models with various degrees will be used.↩︎

  4. For more details about Orthogonal Polynomial Regression see Narula (1979).↩︎

  5. In what follows, we show how to create a validation dataset and how to perform Cross-Validation. For other procedures such as Bootstrapping or Leave-One-Out we refer to Kuhn and Silge (2022).↩︎

  6. If you prefer to use the validation dataset that we developed at the beginning of this section you can change the resamples= argument to resamples=DataValidate.↩︎

  7. At the writing of this book, tidymodels supported more than 30 machine learning models (see https://parsnip.tidymodels.org/reference).↩︎