Chapter 6 Polynomial Regression — Overfitting/Tuning Explained
This is an Open Access web version of the book Practical Machine Learning with R published by Chapman and Hall/CRC. The content from this website or any part of it may not be copied or reproduced. All copyrights are reserved.
If you find an error, have an idea for improvement, or have a question, please visit the forum for the book. You find additional resources for the book on its companion website, and you can ask the experimental AI assistant about topics covered in the book.
If you enjoy reading the online version of Practical Machine Learning with R, please consider ordering a paper, PDF, or Kindle version to support the development of the book and help me maintain this free online version.
The importance of creating training and testing data, and how this can be done in R, was emphasized in the previous chapters. Splitting the data into training and testing data is necessary because you cannot use the same data to train (calibrate) a model’s parameters and assess a model’s predictive quality. If you did that, your model would be overfitted!
Overfitting is a situation where a fitted model approximates the training data well, but fails on new observations, which it has never seen before.
In this chapter, we first explain the problem of overfitting in more detail (see Section 6.3). Then in Section 6.4, we use a univariate Polynomial Model to demonstrate overfitting with an example. This model will predict house prices based on various powers of the predictor variable square footage (e.g., \(Sqft\), \(Sqft^2\), \(Sqft^3\), \(\dots\)).
Polynomial Model
A model that expresses the relationship between the predicted outcome and the predictor variable(s) as a polynomial function is called a Polynomial Model. A Polynomial Model involving one predictor variable is called a univariate Polynomial Model. In contrast a Polynomial Model involving two or more predictor variables is called a multivariate Polynomial Model.
Example - Univariate Polynomial Model with predictor variable \(x\):
\[ \widehat{y}= \beta_1 x +\beta_2 x^2 +\beta_3 x^3 +\beta_4 \]
Example - Multivariate Polynomial Model with two predictor variables ( \(x_1\) and \(x_2\)):
\[ \widehat{y}= \beta_1 x_1 +\beta_2 x_1^2+ \beta_3 x_1 x_2+ \beta_4 x_2 + \beta_5 x_2^2+\beta_6 \]
The term \(x_1 x_2\) is called an interaction term. This is because the term \(x_1 x_2\) only has a high value (high influence on the outcome) if \(x_1\) and \(x_2\) have high values. For example, to predict wage the two numerical predictors skill level and years of experience could be modeled as an interaction term because it requires both skill and experience to earn a high wage.
In what follows, we will work with a univariate Polynomial Model. We will use a very small training dataset to calibrate the model parameters. The small training dataset in connection with the Polynomial Model will lead to an overfitting scenario.
In Section 6.4, the overfitting scenario will allow us to visualize why overfitting can occur and why it is problematic for predicting new observations.
In Section 6.5, we will introduce hyper-parameter tuning, an iterative procedure that helps to avoid overfitting. Hyper-parameter tuning will help us to find the best model complexity (degree of the power for \(Sqft\)) for the Polynomial Model.
In Section 6.6 we will introduce a tuning
template based on the R tidymodels package. This template will allow
you to use the concept of hyper-parameter tuning with any machine
learning model as long as it is supported by tidymodels.
In Section 6.7 you can apply the template from Section 6.6 to work on an interactive project. You will try to find the best value for the hyper-parameter \(k\) for the k-Nearest Neighbors model that was used in Chapter 4 to classify wines into red and white wines.
6.1 Learning Outcomes
This section outlines what you can expect to learn in this chapter. In addition, the corresponding section number is included for each learning outcome to help you to navigate the content, especially when you return to the chapter for review.
In this chapter, you will learn:
To identify under which circumstances overfitting likely occurs (see Section 6.3).
To apply a Polynomial Model to predict house prices (see Section 6.4).
How to explain overfitting in detail (see Section 6.4).
How overfitting can compromise the prediction quality for new data (see Section 6.4).
How to use hyper-parameter tuning to avoid overfitting (see Section 6.5).
How to work with the 10-Step Tuning Template to tune hyper-parameters for various types of machine learning models (see Section 6.5).
How to work with a real-world dataset and apply the tuning template from Section 6.5 to find the best value for the hyper-parameter \(k\) in a k-Nearest Neighbors model (see Section 6.7).
6.2 R Packages Required for the Chapter
This section lists the R packages that you need when you load and execute code in the interactive sections in RStudio. Please install the following packages using Tools -> Install Packages \(\dots\) from the RStudio menu bar (you can find more information about installing and loading packages in Section 3.4):
The
riopackage (Chan et al. (2021)) to enable the loading of various data formats with oneimport()command. Files can be loaded from the user’s hard drive or the Internet.The
janitorpackage (Firke (2023)) to rename variable names to UpperCamel and to substitute spaces and special characters in variable names.The
tidymodelspackage (Kuhn and Wickham (2020)) to streamline data engineering and machine learning tasks.The
kableExtra(Zhu (2021)) package to support the rendering of tables.The
learnrpackage (Aden-Buie, Schloerke, and Allaire (2022)), which is needed together with theshinypackage (Chang et al. (2022)) for the interactive exercises in this book.The
shinypackage (Chang et al. (2022)), which is needed together with thelearnrpackage (Aden-Buie, Schloerke, and Allaire (2022)) for the interactive exercises in this book.The
kknnpackage (Schliep and Hechenbichler (2016)) to run k-Nearest Neighbors models.
6.3 The Problem of Overfitting
Machine learning aims to develop models that can be used to predict outcomes in the future. However, data scientists can only develop machine learning models based on data from the past (training data).36 They use the training data to calibrate a machine learning model`s parameters.
This approach is not without problems. When too successfully calibrating a machine learning model to the training data (i.e., the error based on the training data is very small), the model is extremely specialized in approximating the training data, but it fails with new observations that are not part of the training dataset. This is the core problem of overfitting.
To avoid overfitting while in the stage of model development by adjusting the model design (i.e., choosing hyper-parameter values) is not easy. When we cannot use the testing data to asses different designs, since the testing data can only be used to evaluate the finalized model.
Without being able to use the testing data, a second-best approach to minimize overfitting is Cross-Validation. This procedure utilizes the training data and randomly chooses part of the training data as a holdout validation dataset, to validate different model designs. This process is repeated on a rolling base until each observation was assigned once to the validation dataset (more about Cross-Validation in Section 6.5.2).
Circumstances that Can Lead to Overfitting
Identifying conditions that make overfitting more likely helps with developing strategies to avoid it. In general, overfitting is more likely to occur:
- When the training dataset does not have a sufficient number of observations.
- When the model considers many variables and consequently contains many parameters to calibrate.
- When the underlying machine learning model is highly non-linear.
6.4 Demonstrating Overfitting with a Polynomial Model
To demonstrate how overfitting occurs and which problems result from overfitting, we will use a Polynomial Model to predict housing prices.
As in the interactive Section 5.5, we utilize the King County House Sale dataset (Kaggle (2015)) and split the data into training and testing data:
library(tidymodels); library(rio); library(janitor)
DataHousing=
import("https://ai.lange-analytics.com/data/HousingData.csv") |>
clean_names("upper_camel") |>
select(Price, Sqft=SqftLiving)
set.seed(777)
Split001=DataHousing |>
initial_split(prop=0.001, strata=Price, breaks=5)
DataTrain=training(Split001)
DataTest=testing(Split001)Note that the argument prop=0.001 in the initial_split() command
assigns only 20 observations to the training data. The
remaining 21,593
observations will be used as testing data. The reason to consider only
20 observations for training, although enough
observations are available to create a bigger training dataset, is that
we will purposely create circumstances that can lead to an overfitting
scenario. In real-world analysis, where we work with bigger training
datasets, overfitting is often more subtle and difficult to identify.
We use the prediction equation below to estimate the price of a house based on living square footage \((Sqft)\). The model is polynomial because \(Sqft\) is considered with various powers:
\[\begin{equation} \widehat{Price}=\beta_1 Sqft+\beta_2 Sqft^2+\beta_3 Sqft^3+\beta_4 Sqft^4+\beta_5 Sqft^5+\beta_6 \tag{6.1} \end{equation}\]Since the price of a house \((\widehat{Price})\) is estimated based only on its square footage, the model is classified as univariate. A univariate model was chosen because it will allow us to present the results in a 2D diagram with the price on the vertical and square footage on the horizontal axis.37
As you see in Equation (6.1) the predictor variable \(Sqft\) is used with powers ranging from \(1\) – \(5\), which makes the model non-linear.38
If you look at the circumstances that likely lead to overfitting in the info box at the end of the previous section, you can see that all conditions are fulfilled here:
The training dataset does not have a sufficient number of observations (\(20\) observations in our case).
The model contains many variables and, thus many parameters to calibrate (five variables and six parameters to calibrate is usually not considered many, but compared to only 20 observations the number of parameters can be considered as large).
The model is highly non-linear (a polynomial of degree 5 is highly non-linear).
To create the Polynomial Model with tidymodels we use the same
model design as in the interactive Section 5.5. To
introduce non-linearity we use the recipe (the data):
ModelDesignLinRegr=linear_reg() |>
set_engine("lm") |>
set_mode("regression")
RecipeHouses=recipe(Price~., data=DataTrain) |>
step_mutate(Sqft2=Sqft^2, Sqft3=Sqft^3,
Sqft4=Sqft^4,Sqft5=Sqft^5)The recipe utilizes step_mutate() to create four additional
variables that are calculated as the square, cubic, quartic, and quintic
of the variable \(Sqft\):
Sqft2=\(Sqft^2\),Sqft3=\(Sqft^3\),Sqft4=\(Sqft^4\)Sqft5=\(Sqft^5\).
Because \(Sqft\) is not only used in its original form, but also with various powers (squared, cubic, quartic, and quintic), the model is non-linear (in the data).
When looking at Equation (6.1) you can see why the model is linear in its parameters: If we treat the variables \(Sqft, Sqft^2,\dots, Sqft^5\) the same as any other variable in a multivariate linear OLS model, then each variable is multiplied by a parameter \(\beta_1\) – \(\beta_5\) and added to the equation. Thus Equation (6.1) is a linear function as long as we interpret the different powers of \(Sqft\) as separate variables.
A Polynomial Model is linear in parameters but non-linear in data.
Consequently, the Polynomial Model from Equation (6.1) can still be optimized the same way as a regular OLS model because it is still linear in its parameters, when we treat each power of \(Sqft\) as separate variables.
We optimize by adding the recipe and the linear model design to the
workflow WFModelHouses, which is then fitted to the 20 observations
in the training dataset with fit(DataTrain):
WFModelHouses=workflow() |>
add_recipe(RecipeHouses) |>
add_model(ModelDesignLinRegr) |>
fit(DataTrain)Since the workflow WFModelHouses is fitted to the training data, we
can use it to predict and measure the model’s predictive performance. We
start with measuring the performance regarding the training data.
As before, we use the augment() command to append the predictions to
the training dataset and then we use the metrics() command to
calculate predictive performance metrics (see Sections
4.7.6 and 4.7.7 for details):
DataTrainWithPred=augment(WFModelHouses, DataTrain)
metrics(DataTrainWithPred, truth=Price, estimate=.pred)## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 136432.
## 2 rsq standard 0.715
## 3 mae standard 104047.
The metrics() command by default calculates the root mean square
error (rmse), \(r^2\) (rsq), and the mean average error (mae),
when provided with the column names for the estimate (.pred) and the
truth (Price) in the data frame DataTrainWithPred.
You can see that based on the training data, the Polynomial Model
performs well. For example, based on the mae, the model
under/overestimates on average by $104,000 (for comparison, a linear
model with \(Sqft\) as the only predictor variable would create a mean
average error of $139,000).
The good results for the Polynomial Model based on the training data are also confirmed in Figure 6.1 where we plotted the 20 training data observations (red) together with a linear prediction function (blue) and the prediction function for the Polynomial Model (magenta).
FIGURE 6.1: Approximating Training Data with Polynomial (Degree 5)
You can see that the magenta line (the Polynomial Model) approximates the training data better than the blue line (the linear model). This is because the non-linearity of the Polynomial Model gives the magenta line more flexibility. The magenta line bends downwards to better predict lower-priced houses between 1,300 sqft and 2,200 sqft. Then it bends upwards to approximate higher priced houses with square footage between 2,500 sqft and 4,000 sqft. Finally, it bends down again to almost perfectly approximate the low-priced outlier house point with about 4,500 sqft.
In contrast, the blue line representing the linear benchmark cannot bend and therefore is located in the center of the training data.
The question remains, if the Polynomial Model (the magenta line in Figure 6.1) can also predict the testing data well.
To generate predictive metrics for the testing data, we again use the
augment() and the metrics() commands, but this time we provide the
data frame DataTest to the augment() command::
DataTestWithPred=augment(WFModelHouses, DataTest)
metrics(DataTestWithPred, truth=Price, estimate=.pred)## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 99940240.
## 2 rsq standard 0.0215
## 3 mae standard 1719470.
You can see that the Polynomial Model that performed so well on the
training data performs poorly on the testing data. The mean average
error (mae) shows that the model under/overestimates on average by
$1,719,000 (!!!) based on the testing data (for comparison, a linear
model with \(Sqft\) as the only predictor variable would create a mean
average error of $168,000 based on the testing data).
Why overfitting occurred and why the testing data were predicted so poorly by the Polynomial Model can be seen in Figure 6.2. The figure shows the training data (red dots), the testing data (small black dots) together with the prediction functions for the Polynomial Model (magenta line), and the linear benchmark (blue line).
FIGURE 6.2: Model Performance on Testing and Training Data
The flexibility of the magenta line (the non-linearity of the Polynomial Model) allows to approximate the training data very well, but the strong focus on the training data fails to represent the general trend of the remaining data (testing data). This is exactly the problem of overfitting.
In contrast, the blue linear line cannot bend (the linear one-predictor benchmark model can only produce a straight line). This prevents the model from overfitting (over-approximating the training data).
The problems with overfitting can further be demonstrated when extending the Degree-5 Polynomial Model to a Degree-10 Polynomial Model. This means that in the prediction equation, exponents all the way up to 10 are considered for \(Sqft\):
\[\widehat{Price}=\beta_1 Sqft+\beta_2 Sqft^2+\beta_3 Sqft^3 +\beta_4 Sqft^4 + \cdots +\beta_{9} Sqft^{9}+\beta_{10} Sqft^{10}+\beta_{11}\]
Thanks to the tidymodels package, it is fairly easy to repeat the
analysis for a Degree-10 Polynomial Model by only modifying the
recipe. Instead of using step_mutate() to generate new predictor
variables with various powers of \(Sqft\), we now use step_poly() to
generate 10 new variables for the 10 different powers of \(Sqft\):
RecipeHousesPoly10=recipe(Price~., data=DataTrain) |>
step_poly(Sqft, degree=10,
options=list(raw=TRUE))By default (options=list(raw=FALSE)) the command step_poly(),
transforms the calculated polynomials to orthogonal polynomials, which
allows for better interpretation of the regression parameters
(\(\beta s\)). Since the Polynomial Regression model we use here is too
complex to allow any interpretation of its parameters and since
orthogonal polynomials exceed the scope of this book, we will work with
the original polynomials in what follows (options=list(raw=TRUE)). The
predictions are the same regardless of whether you use original or
orthogonal polynomials.39
If we graph the resulting prediction function of the Degree-10 Polynomial Model (magenta line in Figure 6.3), you can see the complete disaster of extreme overfitting.
FIGURE 6.3: Model Performance on Testing and Training Data
The Degree-10 Polynomial Model represented by the magenta line approximates the 20 training observations (red dots) almost perfectly but totally fails to predict the testing data (small black dots).
For example, the Degree-10 Polynomial Model predicts extremely high housing prices for houses with small square footage, predicts negative housing prices for houses with square footage in the ranges of 1,000 – 1,200 sqft and 2,700 – 3,700 sqft, and predicts extremely high prices for houses with 3,900 – 4,400 sqft.
From Figure 6.3, it is obvious to see why overfitting occurred. However, in almost all real-world cases we will not be able to generate a graph like in Figure 6.3. Therefore, comparing performance between training and testing data is the only way to identify overfitting.
Overfitting Summary
Overfitting occurs when a highly non-linear (flexible) model adjusts almost perfectly to a relatively small number of training observations, but the adjustment is so specific that the model fails when predicting the testing data.
Overfitting is sometimes compared to human learning behavior, when somebody learns facts by heart rather than understanding the underlying theory.
If — after the completion of model development — a comparison of training and testing data reveals overfitting, the model development needs to start from scratch.
The new model development includes generating new sets of training and
testing data (change of the value in set.seed()). Otherwise, we risk
extending the overfitting from the training to the testing data.
6.5 Tuning the Complexity of a Polynomial Model
In the previous section, overfitting occurred in the Degree-5 Polynomial Model and to an extreme degree in the Degree-10 Polynomial Model. The linear model (blue line in Figures 6.2 and 6.3) seemed to be superior to the Degree-5 and Degree-10 Polynomial Model.
However, this does not mean a linear model is always the best choice. For example, when the underlying pattern that generated the data is non-linear, a linear regression model is sub-par to fit the non-linear pattern.
This raises the question:
How can we choose the right degree for a Polynomial Model?
6.5.1 Hyper-Parameters vs. Model Parameters
The appropriate degree for a Polynomial Model cannot be found with the Optimizer that calibrates the \(\beta\)-parameters (model parameters) because in order to run the Optimizer, we have to choose the model first — including the polynomial degree. Therefore parameters like the polynomial degree, that need to be determined before the model calibration, are called hyper-parameters rather than model parameters.
Hyper-Parameters
Hyper-parameters are parameters that change the design of a machine learning model or the data pre-processing in a recipe. For example, the hyper-parameter \(k\) in a k-Nearest Neighbors model design or the degree of a Polynomial Model in a recipe.
Hyper-parameters often control the degree of non-linearity, and thus the flexibility of the prediction function. Therefore, we often face a trade-off between model flexibility and the risk of overfitting.
Hyper-parameters cannot be determined with an Optimizer based on training data like a model’s \(\beta\) parameters. Therefore, we must utilize a trial-and-error process to find suitable hyper-parameters. This process is called hyper-parameter tuning.
We cannot utilize the testing data for hyper-parameter tuning because this can extend overfitting into the testing data!
Finding the best hyper-parameter values is not specific to Polynomial Models only; we ran into this challenge already in Chapter 4 when deciding on the parameter \(k\) (the number of neighbors to consider) for k-Nearest Neighbors. Later, in Chapter 9, when covering Neural Network models, we have to decide how many neurons to consider when building a Neural Network. The number of neurons in a Neural Network is also considered a hyper-parameter. In fact, most machine learning models require finding best hyper-parameters at the model design stage. This process is called hyper-parameter tuning.
6.5.2 Creating the Tuning Workflow
In this and the following sections, we will tune the hyper-parameter for the polynomial degree of the model covered in Section 6.4 using the same dataset. The only exception is the number of observations assigned to the training data. Instead of using an extremely small training dataset, we will use a more realistic training dataset size. In the code block below, we again import the King County House Sale dataset (Kaggle (2015)), but now split the 21,613 observations into 80% training and 20% testing data:
library(tidymodels); library(rio); library(janitor)
DataHousing=
import("https://ai.lange-analytics.com/data/HousingData.csv") |>
clean_names("upper_camel") |>
select(Price, Sqft=SqftLiving)
set.seed(987)
Split80=DataHousing |>
initial_split(prop=0.8, strata=Price, breaks=5)
DataTrain=training(Split80)
DataTest=testing(Split80) Building the workflow for the analysis follows almost the same steps as in the previous section. We create a recipe and a model design. Afterward, we add both to a workflow:
RecipeHousesPolynomOLS=recipe(Price~., data=DataTrain) |>
step_poly(Sqft, degree=tune(),
options=list(raw=TRUE))
ModelDesignLinRegr=linear_reg() |>
set_engine("lm") |>
set_mode("regression")
TuneWFModelHouses=workflow() |>
add_model(ModelDesignLinRegr) |>
add_recipe(RecipeHousesPolynomOLS)However, there are two differences compared to the workflow in Section 6.4:
In
step_poly()where the argumentdegree=determines the highest power of \(Sqft\) in the prediction equation, we do not assign a number for the degree. This makes sense because the aim of hyper-parameter tuning is to find this number (the degree of the Polynomial Model).Since the argument
degree=needs somehow to be determined, we usetune()as a placeholder. It is important not to over-interpret the meaning oftune(). It is only a placeholder that, later, will get replaced by the values fordegreewhen we try and evaluate different values fordegree.The workflow does not contain a
fit()command to fit the model parameters to the training data. This also makes sense: Because thedegreefor the polynomial function is not determined, fitting the model is not possible. Consequently, the workflow cannot be used for predictions. It is only a blueprint for the tuning process that we will later perform. To clarify that the workflow is used for tuning only, the related R object name is prefixed with the wordTune(TuneWFModelHouses).
In order to evaluate the performance of several different degrees for the Polynomial Model, we have to decide which degrees we would like to try out. This is an arbitrary decision.
Nevertheless, the tidymodels package can still provide some guidance.
For most hyper-parameters a related command exists that returns a
recommended hyper-parameter value range. The name of the command is
often the same as the name of the hyper-parameter. For example, you
can use the command degree() to find a recommended range for the
hyper-parameter degree:
## Polynomial Degree (quantitative)
## Range: [1, 3]
The command returns a recommended range for the hyper-parameter
degree from \(1\) – \(3\). We will extend this range and evaluate
polynomial degrees from \(1\) – \(10\).
For tuning purposes, the tidymodels package expects a data frame with
the values for each hyper-parameter in the columns. The column names
must be the same as the name of the respective hyper-parameter. Since
we tune only one hyper-parameter in this case, the data frame
ParGridHouses contains only one column named Degree with the values
from \(1\) – \(10\):
## degree
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
Later, during the tuning process, the hyper-parameter values above
will be pushed one by one to the tuning workflow TuneWFModelHouses.
Each value will fill in for the placeholder tune() and the workflow
will be fitted. Next, its predictive quality will be evaluated. The
workflow (the polynomial degree) with the best performance
constitutes the best model.
6.5.3 Validating the Tuning Results
To find the best model, we need to validate each hyper-parameter value from the tuning results (each hyper-parameter combination in case we tune more than one hyper-parameter). This raises the question:
Which dataset should be used to validate the tuning results?
You might be tempted to use the testing dataset to validate different values for hyper-parameters. However, keep in mind that the testing dataset should never be used for any type of optimization — including hyper-parameter tuning.
If you ignore this rule, you might get a good performance on the testing data. But because now the optimization is specialized for the testing data, you likely will get poor predictive results in the production stage when you confront your model with new data, which it had never seen before. This would be an example of pushing the overfitting problem from the training to the testing data.
Using the complete training dataset to find the best hyper-parameter value is also not an option because the best performing hyper-parameter value would be the one that triggers the highest degree of overfitting.
In what follows, we will introduce two strategies to assess various values of hyper-parameters without using the testing dataset or the complete training dataset:40
Validation Dataset
One option is to randomly choose a number of observations from the training dataset, exclude them from training, and assign them to an additional holdout dataset called the validation dataset.
The validation dataset will never be used for training. Instead, the observations in the validation dataset are set aside to assess the predictive performance for different hyper-parameter values.
A validation dataset is very similar to a testing dataset since both are used to assess the performance of a specific model. The difference is that the assessment is performed at different stages of development. While the validation dataset is used during the model design stage to find the best hyper-parameter values, the testing dataset is used after the development of the machine learning model is finalized to assess overall predictive quality.
The validation_split() command in the code block below can be used to
split the training data into observations that are used for training
(analysis observations) and those that are used to assess the
predictive performance for various hyper-parameter settings
(assessment observations). The command validation_split() is similar
to the command initial_split(). The argument prop=0.85 determines
the percentage of observations leftover for training and the argument
strat=Price ensures that different housing price levels are
proportionally distributed between training and assessment. The splitted
observations are then stored in the data frame DataValidate:
The resulting data frame DataValidate includes the complete training
dataset, but observations are internally earmarked with analysis to
indicate that an observation will be used for training, and with
assessment to indicate that the observation will be used as validation
data to assess hyper-parameter performance.
Using a validation dataset to compare hyper-parameter performance is appropriate for large datasets. For smaller datasets, there are a couple of disadvantages:
Excluding observations from the training process and earmarking them for hyper-parameter assessment reduces the number of observations that are available for training.
The observations used for the assessment of hyper-parameters are randomly chosen. This bears the risk that, by accident, an unusual validation dataset might be created (the risk is higher for smaller training datasets). Evaluating hyper-parameters based on unusual assessment observations might lead to a sub-par choice of hyper-parameter values.
Cross-Validation
Instead of using one dataset where observations are earmarked for training or hyper-parameter assessment, the Cross-Validation procedure creates multiple training/assessment datasets called folds or resamples. These folds differ only by which observations are chosen for training and which ones are used for hyper-parameter assessment. Figure 6.4 shows the basic idea behind Cross-Validation for four folds.
FIGURE 6.4: The Basic Idea Behind Cross-Validation
Cross-Validation shuffles the training dataset and then copies it \(N\) times, assigning each copy to one of \(N\) folds. Each of these folds has a different set of observations excluded from the training and used for the assessment of the various hyper-parameter combinations.
Figure 6.4 shows an example for four folds. The shuffled training dataset is copied four times into Folds 1 – 4. In Fold 1, the last quarter of observations, is assigned to the assessment dataset. The remaining observations are used for training. In Fold 2, the third quarter of observation is designed to the assessment dataset, and the remaining observations are used for training. In Fold 3, the second quarter of observations is assigned to the assessment dataset, and in Fold 4, the first quarter. This assures that every observation is exactly used once in an assessment dataset.
When a model is tuned, each of the hyper-parameter values is assessed for all four folds (requires training of the model for each fold). The overall performance for a hyper-parameter value is calculated as the mean performance of the assessment observations in folds 1, 2, 3, and 4. The same process is then repeated for the other hyper-parameter values.
It is common to choose ten folds if the training dataset is sufficiently
big. For smaller datasets, a lower number of folds can be selected. To
compensate for a low number of folds, the process of shuffling the
training data, creating folds, and training/assessing the models can be
repeated several times. This requires setting the repeat= argument for
the related vfolds_cv() command to a value \(>1\) (the default is
repeat=1).
The advantage of Cross-Validation is that different sets of observations are used for assessment (the mean prediction error is used to assess overall performance) and all observations of the training data at some stage of model assessment are used for validation. Therefore, the risk of an unusual assessment dataset is mitigated.
The disadvantage of Cross-Validation is that each hyper-parameter setup needs to be trained and assessed separately for each of the folds. Computation time increases exponentially with the number of hyper-parameters tried out and proportionally with the number of folds used.
Using the R code block below, you can create the Cross-Validation folds for our Polynomial Model:
For simplicity reasons we generate only four folds, although our dataset
would be big enough to choose the common 10-fold setup. The command
vfold_cv() creates the four folds (see the argument v=4). The
strata argument ensures that the different house prices are
proportionally represented in the various assessment folds.
6.5.4 Executing the Tuning Process
Now that we have stored the four folds for training and
hyper-parameter assessment in the R object FoldsHouses and the
hyper-parameters to be tried out in ParGridHouses, we can use the
command tune_grid() to
evaluate each of the ten parameter values (degree 1 – 10). Given four
folds and ten parameter values to evaluate, the tune_grid() command
has to train and assess a total of 40 model/data variations.
The command tune_grid() executes the tuning process (see the R code
block below). It requires the name of the tuning workflow
(TuneWFModelHouses) as the first argument. Then the data frame with
the hyper-parameter values to be tried out must be provided with the
grid= argument (in our case: grid=ParGridHouses). Finally, an
argument for the R object that holds the folds for training and
assessment (resamples=FoldsHouses) is required. The metrics argument
is optional. In the R code below, the argument
metrics=metric_set(rmse,rsq,mae) determines that performance metrics
for the root mean squared error (rmse), \(r^2\) (rsq), and the mean
average error (mae) are calculated for each parameter value and for
each fold:41
TuneResultsHouses=tune_grid(TuneWFModelHouses, resamples=FoldsHouses,
grid=ParGridHouses,
metrics=metric_set(rmse,rsq,mae))Tuning a workflow can take a while, from a few seconds to a day or
more depending on the number of hyper-parameters to tune and the
number of folds used for validation. In the code block above, the
results from tune_grid() for each parameter value, each fold, and each
performance metric are saved in the R object TuneResultsHouses.
There are several ways to extract information from the R object
TuneResultsHouses. For example, you can use the command autoplot()
to create a graphical overview of the performance for the different
hyper-parameter values (see Figure 6.5).
FIGURE 6.5: Performance Metrics During Tuning
The plots in Figure 6.5 indicate that a linear
equation (degree=1) performs not as well as some of the polynomials
with a degree>1.
The performance for degree 2 – 8 is similar for the three metrics
(with the exception of degree=7 for the rsq metric).
Polynomials with degrees 9 and 10 have a poor predictive performance based on the assessment from the Cross-Validation folds.
To see more details, we extract the hyper-parameter value rankings
from the workflow TuneResultsHouses based on the three performance
measures. We start with the best five hyper-parameters using the
metric root mean squared error (rmse):
## # A tibble: 5 × 7
## degree .metric .estimator mean n std_err
## <int> <chr> <chr> <dbl> <int> <dbl>
## 1 6 rmse standard 251993. 4 6179.
## 2 8 rmse standard 252979. 4 5971.
## 3 2 rmse standard 255965. 4 7243.
## 4 3 rmse standard 257680. 4 7875.
## 5 4 rmse standard 260994. 4 9911.
## # ℹ 1 more variable: .config <chr>
The best (lowest) rmse was
251,993
for a polynomial degree of
6. The linear model
(degree=1) did not make it to the top five.
The ranking of the five best-performing models based on \(r^2\) is the
same as the one for rmse:
## # A tibble: 5 × 7
## degree .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 6 rsq standard 0.522 4 0.0304 Prepro…
## 2 8 rsq standard 0.521 4 0.0110 Prepro…
## 3 2 rsq standard 0.513 4 0.0277 Prepro…
## 4 3 rsq standard 0.508 4 0.0226 Prepro…
## 5 4 rsq standard 0.498 4 0.0295 Prepro…
The ranking for mean average error mae is slightly different.
However, the best-performing model is still the Degree-6 Polynomial
Model:
## # A tibble: 5 × 7
## degree .metric .estimator mean n std_err
## <int> <chr> <chr> <dbl> <int> <dbl>
## 1 6 mae standard 165798. 4 1924.
## 2 8 mae standard 165868. 4 1764.
## 3 4 mae standard 166325. 4 2137.
## 4 3 mae standard 166434. 4 1863.
## 5 2 mae standard 166546. 4 1840.
## # ℹ 1 more variable: .config <chr>
Since all three performance measures ranked a Polynomial Model of
degree 6 as the best model, it does not matter which performance measure
we choose to extract the degree for the best model. For example, to
extract the best-performing hyper-parameter to minimize rmse, we can
use:
## # A tibble: 1 × 2
## degree .config
## <int> <chr>
## 1 6 Preprocessor06_Model1
The printout of BestHyperPar shows that the best-performing
hyper-parameter value is saved in the data frame column degree as
the first and only entry.
We will use this data frame to create a model (the best one) with the
best hyper-parameter (degree=6).
To do this, we add the best hyper-parameter to the tune workflow
with finalize_workflow(). This command substitutes the tune()
placeholder with the optimal hyper-parameter value that is stored in
BestHyperPar. Afterward, we add the fit() command to the pipe to
train the workflow model:
BestWFModelHouses=TuneWFModelHouses |>
finalize_workflow(BestHyperPar) |>
fit(DataTrain)
print(BestWFModelHouses)## ══ Workflow [trained] ═══════════════
## Preprocessor: Recipe
## Model: linear_reg()
##
## ── Preprocessor ─────────────────────
## 1 Recipe Step
##
## • step_poly()
##
## ── Model ────────────────────────────
##
## Call:
## stats::lm(formula = ..y ~ ., data = data)
##
## Coefficients:
## (Intercept) Sqft_poly_1 Sqft_poly_2 Sqft_poly_3
## -1.30e+04 6.57e+02 -4.90e-01 2.05e-04
## Sqft_poly_4 Sqft_poly_5 Sqft_poly_6
## -3.76e-08 3.18e-12 -9.93e-17
The printout above from the fitted workflow confirms that
WFModelHouses is a fitted workflow because it contains the values
for the estimated model-parameters (see Coefficients). Consequently,
WFModelHouses can be used for predictions. We use the augment()
command to predict based on the testing dataset and the metrics()
command to calculate the related metrics:
DataTestWithPredBestModel=augment(BestWFModelHouses, DataTest)
metrics(DataTestWithPredBestModel, truth=Price, estimate=.pred)## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 240706.
## 2 rsq standard 0.586
## 3 mae standard 164987.
FIGURE 6.6: Poynomial Degree-6 vs. Linear Prediction Functions
Given that we used a univariate model with \(Sqft\) being the only variate, the results look quite good. Based on the testing data \(r^2=0.5857\). The mean average error shows that the housing price is, on average, under/overestimated by $164,987 (for comparison, a linear model with \(Sqft\) as the only predictor variable would create a mean average error of $173,000 based on the testing data).
Figure 6.6 ilustrates why the Degree-6 Polynomial Model (magenta line) performs better than a linear regression (blue line) and why it does not lead to overfitting: Although a polynomial function of Degree-6 is potentially very flexible, you can see in Figure 6.6 that it differs only slightly from the linear prediction function for houses smaller than 3,000 sqft. For houses larger than 3,000 sqft, the Degree-6 prediction function estimates higher valued houses much better than the linear function.
6.6 10-Step Template to Tune with tidymodels
In the
previous section, we used the tidymodels package to tune a polynomial
machine learning model. Since tidymodels provides a unified analysis
framework independent of the machine learning model, you can use the
same set of commands for many other machine learning models.
In this section, we provide a 10-Step Template to make it easy for you to develop a complete machine learning analysis that includes tuning hyper-parameters and also assessing the final results based on the testing data. You can use the template for all machine learning models covered in this book and for other machine learning models not covered here as well.42
Below you will find the 10-Step Template together with sample code. In the Digital Resources section for this chapter (see Section 6.10), you will find an R script that contains the R code for the 10-Step Template together with an example.
- Step 1 - Generate Training and Testing Data:
-
Note, it is assumed that the data frame
MyDatacontains the data you are analyzing.set.seed(987) Split80=MyData |> initial_split(prop=0.8, strata=<OUTCOME VARIABLE>, breaks=5) DataTrain=training(Split80) DataTest=testing(Split80)Substitute
<OUTCOME VARIABLE>with the name of your outcome variable for thestrata=argument. This ensures that the distribution of your outcome variable is similar in the training and testing data. - Step 2 - Create a Recipe:
-
In the code block below, substitute
<OUTCOME VARIABLE>with your outcome variable and<PREDICTOR VARIABLE(S)>with a list of your predictor variables separated by “+”-signs.Alternatively, you can use a “
.” on the right of the “~”-sign to use all predictor variables from the related data frame.Recipe=recipe(<OUTCOME VARIABLE>~<PREDICTOR VARIABLE(S)>, data=DataTrain) |> step_<NAME OF STEP>(<ARGUMENT(S) OF STEP>)Note that
step_<NAME OF STEP>()stands for an optional pre-processing step of the predictor variables.<ARGUMENT(S) OF STEP>represents optional arguments for the relatedstep_command.If you plan to tune a hyper-parameter in a recipe you have to assign the
tune()placeholder rather than a value to the related argument. For example,degree=tune()instep_poly().You can find a list of available step commands together with their names at: https://recipes.tidymodels.org/reference.
- Step 3 - Create a Model Design:
-
Substitute
<NAME OF ML-COMMAND>with the command name for the related machine learning model and optional with arguments for<ARGUMENT(S) OF COMMAND>.Then substitute
<PACKAGE NAME>in theengine()command with the package name for the machine learning model and<MODE>in theset_mode()command with either regression or classification.ModelDesign=<NAME OF ML-COMMAND>(<ARGUMENT(S) OF COMMAND>) |> set_engine("<PACKAGE NAME>") |> set_mode("<MODE>")If you plan to tune a hyper-parameter in a model design, you have to assign the
tune()placeholder rather than a value to the related argument. For example,neighbors=tune()in thenearest_neighbor()command.You can find the command names for various machine learning models together with the related package names for the
set_engine()command at: https://parsnip.tidymodels.org/reference. - Step 4 - Add the Recipe and the Model Design to a Workflow:
-
In the code block below the recipe (named:
Recipe) and the model design (named:ModelDesign) are added to the workflowTuneWFModel: - Step 5 - Create a Hyper-Parameter Grid:
-
The hyper-parameter values that need to be tried out must be listed in a data frame column named with the same name as the hyper-parameter:
The
data.frame()command is one way to create the required data frame.<HYPER-PAR1>is the name of a hyper-parameter, and thec()command can be utilized to provide a list of values. If you tune more than one hyper-parameter, add these in the same way.If you tune only one hyper-parameter, for example, only the number of
neighborsin a k-Nearest Neighbors model, the code could look like this: - Step 6 - Create Resamples for Cross-Validation:
-
To create the
resamplesfor Cross-Validation, you can use the following code:A typical Cross-Validation setup includes ten folds. If you like to work with a smaller number of folds, especially for smaller datasets, change
v=10accordingly to reflect the number of folds.To ensure the outcome variable is similarly distributed in each section of the folds, substitute
<OUTCOME VARIABLE>with the name of your outcome variable. - Step 7 - Tune the Workflow and Train All Models:
-
The command
tune_grid()trains models for all hyper-parameter combinations stored in the data frameParGridusing allresamplesstored in the previous step inFoldsForTuning.TuneResults=tune_grid(TuneWFModel, resamples=FoldsForTuning, grid=ParGrid, metrics=metric_set(<LIST OF METRICS>)), control_grid(verbose=TRUE))The optional
metricsargument specifies the metrics that are calculated. For example, substitutemetric_set(<LIST OF METRICS>)withmetric_set(rmse, rsq, mae)for a regression or withmetric_set(accuracy, sensitivity, specificity)for a classification problem.The argument
control_grid(verbose=TRUE)is optional. When used like here withverbose=TRUE, the tuning reports its progress to the R Console. - Step 8 - Extract the Best Hyper-Parameter(s):
-
Because all assessment results for the specified metrics are stored in the tuning object
TuneResults, we can useselect_best()to extract the best hyper-parameter(s) for the metric we are interested in:You need to specify which metric should be used to identify the best-performing model by substituting
<METRIC>with the metric of your choice. Note that only metrics specified previously in Step 7 can be chosen. - Step 9 - Finalize and Train the Best Workflow Model:
-
You can use the command
finalize_workflow()to substitute thetune()inTuneWFModelwith the values fromBestHyperPar. Afterward, the commandfit(DataTrain)trains the finalized workflow model with the training data:
- Step 10 - Assess Prediction Quality Based on the Testing Data:
-
This step should only be performed after the model is completed and no further changes are planned because otherwise, you cannot use the testing data.
DataTestWithPredBestModel=augment(BestWFModel, DataTest) metrics(DataTestWithPredBestModel, truth=<OUTCOME VARIABLE>, estimate=.pred)The
augment()command writes the predictions into a column named.predand augments the testing data frame with that column. The resulting data frame is saved asDataTestWithPredBestModel.Since the
metrics()command needs to compare these predictions with the true values to calculate the metrics, the name of the outcome variable also needs to be provided by substituting<OUTCOME VARIABLE>accordingly.
6.7 🧭Project: Tuning a k-Nearest Neighbors Model
Interactive Section
In this section, you will find content together with R code to execute, change, and rerun in RStudio.
The best way to read and to work with this section is to open it with RStudio. Then you can interactively work on R code exercises and R projects within a web browser. This way you can apply what you have learned so far and extend your knowledge. You can also choose to continue reading either in the book or online, but you will not benefit from the interactive learning experience.
To work with this section in RStudio in an interactive environment, follow these steps:
Ensure that both the
learnRand theshinypackage are installed. If not, install them from RStudio’s main menu (Tools -> Install Packages \(\dots\)).Download the
Rmdfile for the interactive session and save it in yourprojectfolder. You will find the link for the download below.Open the downloaded file in RStudio and click the
Run Documentbutton, located in the editing window’s top-middle area.
For detailed help for running the exercises including videos for Windows and Mac users we refer to: https://blog.lange-analytics.com/2024/01/interactsessions.html
Do not skip this interactive section because besides providing applications of already covered concepts, it will also extend what you have learned so far.
Below is the link to download the interactive section:
https://ai.lange-analytics.com/exc/?file=06-TrainTestExerc100.Rmd
In Section 4.9, you used a k-Nearest-Neighbor model to predict the color of a wine. We arbitrarily set \(k=4\) to consider the four nearest neighbors.
In this section, you will work on the same problem with an interactive project, but you will tune the hyper-parameter \(k\) with Cross-Validation to find an optimal \(k\) (good approximation of the training data without overfitting). You will use the 10-Step Template from Section 6.6 to make it easy to setup the code.
In the Digital Resources section for this chapter (see Section 6.10) you find a link to a blog post that describes how to use the 10-Step Template in detail. The blog post also provides the R code for the template.
- Step 1 - Generating Training and Testing Data:
-
As before, you use the wine dataset and split the data into training (
DataTrain) and testing data (DataTest). Since you use the same value in theset.seed()command, the (random) split will be identical to the one we used before with the \(k=4\) Nearest Neighbor model. The code below has been executed already.library(tidymodels); library(rio); library(janitor) DataWine=import("https://ai.lange-analytics.com/data/WineData.rds") |> clean_names("upper_camel") |> rename(Sulfur=TotalSulfurDioxide) |> mutate(WineColor=as.factor(WineColor)) set.seed(876) Split7030=initial_split(DataWine, prop=0.7, strata=WineColor) DataTrain=training(Split7030) DataTest=testing(Split7030) head(DataTrain)## WineColor Acidity VolatileAcidity CitricAcid ## 1 red 10.8 0.320 0.44 ## 2 red 6.7 0.855 0.02 ## 3 red 7.5 0.380 0.57 ## 4 red 7.1 0.270 0.60 ## 5 red 8.0 0.580 0.28 ## 6 red 7.6 0.400 0.29 ## ResidualSugar Chlorides FreeSulfurDioxide Sulfur ## 1 1.6 0.063 16 37 ## 2 1.9 0.064 29 38 ## 3 2.3 0.106 5 12 ## 4 2.1 0.074 17 25 ## 5 3.2 0.066 21 114 ## 6 1.9 0.078 29 66 ## Density PH Sulphates Alcohol Quality ## 1 0.9985 3.22 0.78 10.00 6 ## 2 0.9947 3.30 0.56 10.75 6 ## 3 0.9960 3.36 0.55 11.40 6 ## 4 0.9981 3.38 0.72 10.60 6 ## 5 0.9973 3.22 0.54 9.40 6 ## 6 0.9971 3.45 0.59 9.50 6 - Step 2 - Create a Recipe:
-
Here, you will create a recipe and store it in the R object
Recipe. Usestep_rm()to remove the predictor variable \(Quality\) because it is not related to \(WineColor\), and the commandstep_normalize()to normalize all remaining predictors. Please, substitute<THESE>placeholders accordingly and execute the code. Note that the data frameDataTrainhas already been loaded in the background: - Step 3 - Create a Model Design:
-
Next, you will create the model design and store it into the R object
ModelDesign. Since you plan to tune the argumentneighbors=(stands for the \(k\) in Nearest Neighbors), you have to add it as an argument into thenearest_neighbor()command by substituting<ARGUMENT(S) OF COMMAND>with the argument and its value. Remember that you cannot setneighborsto a specific numerical value because you want to tune the hyper-parameterneighborslater on. Therefore you have to assign the placeholdertune()to the argumentneighbors. - Step 4 - Add the Recipe and the Model Design to a Workflow:
-
The code block below adds the R object
Recipeand the model design objectModelDesignto a workflow model namedTuneWFModel. The R code has been executed already.## ══ Workflow ═════════════════════════ ## Preprocessor: Recipe ## Model: nearest_neighbor() ## ## ── Preprocessor ───────────────────── ## 2 Recipe Steps ## ## • step_rm() ## • step_normalize() ## ## ── Model ──────────────────────────── ## K-Nearest Neighbor Model Specification (classification) ## ## Main Arguments: ## neighbors = tune() ## ## Computational engine: kknnYou can see in the printout above that the workflow is not finalized because the number of
neighborshas not been set in the model design. The hyper-parameterneighborsis set totune()instead and will later, in Step 7, be replaced in a trial and error process with different values forneighbors. - Step 5 - Create a Hyper-Parameter Grid:
-
Later, when tuning is executed in Step 7, values reaching from 1 – 15 for the hyper-parameter
neighborsshall be tried out.You need to provide these values in a data frame column that is named the same as the hyper-parameter. Below, replace the
<LIST OF VALUES>to define a columnneighborsin the data frameParGrid. The column should contain values from 1 – 15: - Step 6 - Creating Resamples for Cross-Validation:
-
The values you have created above for \(k\) (hyper-parameter
neighbors) will be evaluated later using five folds (resamples). Each fold contains the complete training data, but different sections are used for training and assessment in each fold.Please create five folds below by substituting
<NUMBER OF FOLDS>accordingly.The folds will be saved in the R object
FoldsForTuning: - Step 7 - Tune the Workflow and Train All Models:
-
Now it is time to run the tuning procedure using the
tune_grid()command. Be patient because it will take some time to fully execute. Since we have to try out 15 parameters and use five folds for each model, thetune_grid()command has to fit 75 models \((15\cdot 5=75)\).Please substitute
<LIST OF METRICS>with a list of metrics to be calculated. Use the metrics accuracy, specificity, and sensitivity.After the tuning is finished, all results are stored in the R object
TuneResults, and they can be evaluated by different metrics commnands.For example the command
autoplot()provides a diagrammatic overview of the results for all three metrics.TuneResults=tune_grid(TuneWFModel, resamples=FoldsForTuning, grid=ParGrid, metrics=metric_set(<LIST OF METRICS>)) autoplot(TuneResults)The three graphs that you will create in the exercise above are also displayed in Figure 6.7. They show for each metric the related accuracy, specificity, and sensitivity for all tried out hyper-parameters (
neighbors). The results for the five folds are averaged.
FIGURE 6.7: Tuning Results for Various k Values
You can see that
neighborsvalues between 1 and 4 produce the best results for accuracy (predicting red and white wines (positive and negative class) correctly) and sensitivity (predicting red wines (positive class) correctly). If you look at specificity (predicting white wines (negative class) correctly), you can see that \(k=5\) produces the best result. However, \(k=5\) also leads to a sharp decrease in sensitivity and a decrease in accuracy as well.It seems reasonable to use the best result for accuracy, which means choosing a \(k\) between 1 and 4.
- Step 8 - Extract the Best Hyper-Parameter(s):
-
All assessment results for the specified metrics are stored in the tuning object
TuneResults. Choose the metric accuracy by substituting<METRIC>accordingly. Afterward, when you execute the code theselect_best()command extracts the best hyper-parameter (value forneighbors) for the metric you specified:As you will see in the printout after executing the R code, the best value for the hyper-parameter
neighborsbased on accuracy is 1. It is saved in the data frameBestHyperPar.You can try other metrics in the code block above and see if the result changes. Why would the metric
rmsecause an error? - Step 9 - Finalize and Train the Best Workflow Model:
-
The code block below is executed already. The
finalize_workflow()command used the value fromBestHyperParto substitute thetune()placeholder in the R objectTuneWFModel(created in Steps 2 – 4). The hyper-parameter is now set to `neighbors=`1completing the workflow. The commandfit(DataTrain)calibrates the workflow to the training data and the result is saved intoWFModelBest.## ══ Workflow [trained] ═══════════════ ## Preprocessor: Recipe ## Model: nearest_neighbor() ## ## ── Preprocessor ───────────────────── ## 2 Recipe Steps ## ## • step_rm() ## • step_normalize() ## ## ── Model ──────────────────────────── ## ## Call: ## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(1L, ## ## Type of response variable: nominal ## Minimal misclassification: 0.009383 ## Best kernel: optimal ## Best k: 1At the end of the printout above, you can see that \((k)\) was set to 1 rather than
tune(). You can also see thatWFModelBestis fitted to the training data because the “Minimum missclassification” is reported (based on the training data). - Step 10: Assess Prediction Quality Based on the Testing Data:
-
Since
WFModelBestis a fitted model, you can use it for predictions. In this last step, you use theaugment()command to predictWineColor. Theaugment()command will then add the prediction results as column.predto the testing data.The
conf_mat()command compares the predictions in column.predto the true values to create a confusion matrix. But before, you have to substitute<OUTCOME VARIABLE>with the variable name (column name) for the outcome variable.DataTestWithPredBestModel=augment(WFModelBest, DataTest) conf_mat(DataTestWithPredBestModel, truth=<OUTCOME VARIABLE>, estimate=.pred_class)You will see in the confusion matrix that based on the testing data from 480 red wines, only six were misclassified. Likewise, from the 480 white wines, only seven were misclassified. Given that the classification was only based on the chemical properties of the wines, the results are excellent.
Most likely, a true wine expert might have reached a similar impressive result, but it would have taken them a long time to classify 960 wines.
6.8 When and When not to Use Polynomial Regression
Polynomial Regression is a straightforward but not a sophisticated machine learning procedure. Therefore, it should only be used for basic non-linear relationships where possible interactions between predictor variables are known.
Regular OLS models and basic Polynomial Models have the advantage that the coefficients are directly interpretable. This is not true anymore even for slightly more complex Polynomial Models.
Since the advantage of direct coefficient interpretability is lost for more complex Polynomial models, it is recommended to use more powerful machine learning models such as Neural Networks (see Chapter 9) or tree based models like Random Forest (see Chapter 10) when analyzing complex regression problems.
6.9 When and When not to Use Tuning
Anytime a machine learning model has hyper-parameters from which you believe they have an impact on the predictive quality, you should use tuning.
Even if you have a small dataset, you can use tuning. Although the tuning procedures described here are not well suited for small datasets you can use procedures like Bootstrapping or Leave-One-Out (see Kuhn and Silge (2022) for more details) for smaller datasets.
Deciding which hyper-parameters to tune and how many values to try can be challenging. This is especially true when you tune more than one hyper-parameter. In that case, you have to try different combinations of the values for each hyper-parameter, and consequently, the number of models to tune can get very big very fast. The number of models the tuning has to fit equals the number of folds times the number of hyper-parameter value combinations. For example, if you have ten folds and three hyper-parameters with five values each, and you want to try out all combinations of these hyper-parameter values, you have to fit 1,250 model/data combinations \((10\cdot5\cdot5\cdot5=1,250)\).
6.10 Digital Resources
Below you will find a few digital resources related to this chapter such as:
- Videos
- Short articles
- Tutorials
- R scripts
These resources are recommended if you would like to review the chapter from a different angle or to go beyond what was covered in the chapter.
Here we show only a few of the digital resourses. At the end of the list you will find a link to additonal digital resources for this chapter that are maintained on the Internet.
You can find a complete list of digital resources for all book chapters on the companion website: https://ai.lange-analytics.com/digitalresources.html
Polynomial Regression Video
Mike X. Cohen provides a YouTube video that explains the basic idea of Polynomial Regression.

The Danger of Overfitting
This video by Cassie Kozyrkov, former Chief Decision Scientist at Google, explains why splitting data into training and testing data is important. She also explains why overfitting is a problem.

Supported Recipe Steps for Preprocessing
Here is a list of all recipe step_() commands that can be piped with |> to a recipe. The linked website will tell you which steps are available for which preprocessing purpose.

Supported Machine Learning Models from tidymodels
Here is a list of all supported tidymodels machine learning models. The linked website will tell you for each model:
- the model name
- the package name(s) for the set_engine() command
- the hyper-parameters that you can tune

A 10-Step Template to Create, Tune, and Assess a Machine Learning Model with tidymodels
The link below will open a blog article by Carsten Lange. The article provides a tidymodels 10-step template for creating, tuning, and assessing machine learning models. The template is explained in detail and a link for downloading the related R script is provided.

More Digital Resources
Only a subset of digital resources is listed in this section. The link below points to additional, concurrently updated resources for this chapter.

References
This is at least true for supervised models. Unsupervised models like reinforcement models that can improve during the production stage exceed the scope of this book. In an IBM blog article Delua (2021) provides a brief comparison between supervised and un-supervised machine learning models.↩︎
A multivariate real estate model will be covered in Chapter 7.↩︎
Later in this chapter, Polynomial Models with various degrees will be used.↩︎
For more details about Orthogonal Polynomial Regression see Narula (1979).↩︎
In what follows, we show how to create a validation dataset and how to perform Cross-Validation. For other procedures such as Bootstrapping or Leave-One-Out we refer to Kuhn and Silge (2022).↩︎
If you prefer to use the validation dataset that we developed at the beginning of this section you can change the
resamples=argument toresamples=DataValidate.↩︎At the writing of this book,
tidymodelssupported more than 30 machine learning models (see https://parsnip.tidymodels.org/reference).↩︎