Chapter 7 Ridge, Lasso, and Elastic-Net — Regularization Explained

This is an Open Access web version of the book Practical Machine Learning with R published by Chapman and Hall/CRC. The content from this website or any part of it may not be copied or reproduced. All copyrights are reserved.

If you find an error, have an idea for improvement, or have a question, please visit the forum for the book. You find additional resources for the book on its companion website, and you can ask the experimental AI assistant about topics covered in the book.

If you enjoy reading the online version of Practical Machine Learning with R, please consider ordering a paper, PDF, or Kindle version to support the development of the book and help me maintain this free online version.

In the previous chapter, you learned how to adjust hyper-parameters during the model design stage to avoid overfitting while still getting a good approximation of the training data.

In this chapter, we will introduce regularization. Regularization is another technique to avoid overfitting, and it is applied at the training stage of a machine learning algorithm — when the optimal $\beta$ parameters are determined. Regularization adds penalties for large parameters to a machine learning model’s target function. Here we will cover the most common penalty types Lasso, Ridge, and Elastic-Net. The latter is a combination between Lasso and Ridge. These penalties limit the flexibility of the underlying regression model during the optimization phase by eliminating or significantly weakening model-parameters ($\beta s$) that are not crucial for the prediction quality. We will explain the idea behind regularization in Section 7.4.

In Section 7.4.1, a Lasso regularized univariate Polynomial Regression model will be introduced, and in Section 7.4.2, a Ridge regularized Polynomial Regression model will be presented. Section 7.4.3 shows how Ridge and Lasso models can be combined into the Elastic-Net approach.

In the interactive Section 7.5, you will use a multivariate Elastic-Net regression model to estimate house prices. You can experiment with the hyper-parameters that control the mix of Ridge and Lasso in an Elastic-Net regression model. Afterward, you will tune the Elastic-Net model to optimize predictive quality.

7.1 Learning Outcomes

This section outlines what you can expect to learn in this chapter. In addition, the corresponding section number is included for each learning outcome to help you to navigate the content, especially when you return to the chapter for review.

In this chapter, you will learn:

The basic idea behind regularization (see Section 7.4)
The difference between the penalty terms for Lasso and Ridge regression models (see Section 7.4)
How the target function for Lasso regularized regression models differs from the $MSE$ function of an unregularized model (see Section 7.4.1)
How to create a workflow for a Lasso regularized regression using the R tidymodels framework (see Section 7.4.1)
How Lasso regularized parameter estimates are affected by the value of the Lasso penalty hyper-parameter (see Section 7.4.1)
How the target function for Ridge regularized regression model differs from the $MSE$ function of an unregularized model (see Section 7.4.2)
How to create a workflow for a Ridge regularized model using the R tidymodels framework (see Section 7.4.2)
How Ridge regularized parameter estimates are affected by the value of the Ridge penalty hyper-parameter (see Section 7.4.2)
How Elastic-Net regularization combines elements of both Lasso and Ridge to create a more flexible regularization function (see Section 7.5.3).
How to create a workflow for a Elastic-Net regularized model using the R tidymodels framework (see Section 7.5.3)
How to tune Elastic-Net hyper-parameters and how to measure the final predictive performance (see Section 7.5.4)

7.2 R Packages Required for the Chapter

This section lists the R packages that you need when you load and execute code in the interactive sections in RStudio. Please install the following packages using Tools -> Install Packages $\dots$ from the RStudio menu bar (you can find more information about installing and loading packages in Section 3.4):

The rio package (Chan et al. (2021)) to enable the loading of various data formats with one import() command. Files can be loaded from the user’s hard drive or the Internet.
The janitor package (Firke (2023)) to rename variable names to UpperCamel and to substitute spaces and special characters in variable names.
The tidymodels package (Kuhn and Wickham (2020)) to streamline data engineering and machine learning tasks.
The kableExtra (Zhu (2021)) package to support the rendering of tables.
The learnr package (Aden-Buie, Schloerke, and Allaire (2022)), which is needed together with the shiny package (Chang et al. (2022)) for the interactive exercises in this book.
The shiny package (Chang et al. (2022)), which is needed together with the learnr package (Aden-Buie, Schloerke, and Allaire (2022)) for the interactive exercises in this book.
The glmnet package (Friedman, Tibshirani, and Hastie (2010), Tay, Narasimhan, and Hastie (2023)), which is needed to execute Lasso, Ridge, and Elastic-Net regression models.

7.3 Unregularized Benchmark Model

To demonstrate the idea behind regularization, we start by comparing an unregularized model that we introduce in this section to three different regularized models in Section 7.4.

As in Chapter 6, our goal is to predict house prices based on the King County House Sale dataset (Kaggle (2015)). The code block below imports the dataset and selects the predictor variables. Afterward, the observations are split into training and testing datasets. Both datasets will be used for the unregularized model in this section as well as for the three regularized models in Section 7.4.

library(tidymodels); library(rio); library(janitor)
DataHousing=
  import("https://ai.lange-analytics.com/data/HousingData.csv") |>
  clean_names("upper_camel") |>
  select(Price, Sqft=SqftLiving)

set.seed(777)
Split001=initial_split(DataHousing, prop=0.001, strata=Price, breaks=5)
DataTrain=training(Split001)
DataTest=testing(Split001)

Note that we set prop=0.001, resulting in a training dataset with only 20 observations. We use such a small training dataset purposely to create a scenario where overfitting becomes problematic, allowing us to show how regularization can mitigate an overfitting problem later.

As in Chapter 6, we use a Degree-5 Polynomial Regression model:

\[\begin{equation} \widehat{Price}_i=\beta_1 Sqft_i+\beta_2 Sqft_i^2+\beta_3 Sqft_i^3+\beta_4 Sqft_i^4+\beta_5 Sqft_i^5+\beta_6 \tag{7.1} \end{equation}\]

For an unregularized model, like the one here, the model-parameters are determined by the Optimizer with the goal to minimize the error function — the Mean Squared Error ($MSE$):

\[\begin{eqnarray} MSE&=&\frac{1}{20}\sum_{i=1}^{20} \left ( \widehat{Price}_i-Price_i\right)^2 \tag{7.2}\\ \mbox{with:}&& \widehat{Price}_i=\beta_1 Sqft_i+\beta_2 Sqft_i^2+\beta_3 Sqft_i^3+\beta_4 Sqft_i^4+\beta_5 Sqft_i^5+\beta_6 \nonumber \end{eqnarray}\]

When you substitute the estimated price $(\widehat{Price}_i)$ for a house observation $i$ in the error function (7.2) with the prediction function (also shown in equation (7.2)), you can see that the $MSE$ only depends on the model-parameters (the $\beta s$). This is because the training data already determines all other values ($Price_i$ and $Sqft_i$).

Consequently, the Optimizer can reach the goal of minimizing the $MSE$ by finding the optimal model parameters $(\beta_{j, opt.})$ with a systematic trial-and-error approach.

The code block further below shows that there is little difference between WFModelBenchmark and the workflow we created in Section 6.4 for a Polynomial Regression model.

Only the recipe is different because we use step_normalize(all_predictors()) to Z-score normalize all predictors (see Section 4.6 for details about normalizing predictor variables).

Normalization is usually not required for a Polynomial Regression model. Still, here we normalize the predictors because we will use this workflow model as a benchmark to compare with three regularized regression models — Lasso (see Section 7.4.1), Ridge (see Section 7.4.2), and Elastic-Net (see 7.4.3). Since all three models require normalization, the benchmark model must also use normalized predictors to be comparable.

To find out why regularized models need normalization, go to the Digital Resources section for this chapter (see Section 7.7).

library(tidymodels)
ModelDesignBenchmark=linear_reg() |>
                     set_engine("lm") |> 
                     set_mode("regression")

RecipeHouses=recipe(Price~., data=DataTrain) |> 
             step_mutate(Sqft2=Sqft^2,Sqft3=Sqft^3,
                         Sqft4=Sqft^4,Sqft5=Sqft^5) |> 
             step_normalize(all_predictors())

WFModelBenchmark=workflow() |> 
                 add_model(ModelDesignBenchmark) |> 
                 add_recipe(RecipeHouses) |> 
                 fit(DataTrain)

Since the code above creates a fitted workflow model for the Degree-5 Polynomial model, we can extract the $\beta$-parameters from the fitted workflow model for the house price analysis by utilizing the tidy() command:

tidy(WFModelBenchmark)

## # A tibble: 6 × 5
##   term           estimate  std.error statistic  p.value
##   <chr>             <dbl>      <dbl>     <dbl>    <dbl>
## 1 (Intercept)     509945.     36463.    14.0    1.28e-9
## 2 Sqft           8853783.  10515448.     0.842  4.14e-1
## 3 Sqft2        -50947114.  54352075.    -0.937  3.64e-1
## 4 Sqft3        112589222. 111217647.     1.01   3.29e-1
## 5 Sqft4       -106894260. 101985738.    -1.05   3.12e-1
## 6 Sqft5         36592435.  34688741.     1.05   3.09e-1

We will come back to these values later on. For now, keep in mind that the individual $\beta$ parameters (see the estimate column) are not interpretable. This is because a univariate Degree-5 Polynomial model is already too complex to allow for the interpretation of its $\beta$ parameters. However, try to remember that with the exception of the intercept ($\beta_0$) most $\beta$ values are in the one to three-digit million range. You will later see how regularization can lower some of these values or even set them to zero.

Now, let us look at the predictive quality of the unregularized benchmark model (WFModelBenchmark). We start with evaluating the training data with the R code shown in the code block below. The augment() command creates the predictions and augments them to the training data. Afterward, the metrics() command calculates the root mean squared error (rmse), $r^2$ (rsq), and the mean absolute error (mae) from the data frame DataTrainWithPredBenchmark which includes both the predicted and the true price for the training observations:

DataTrainWithPredBenchmark=augment(WFModelBenchmark, DataTrain)
metrics(DataTrainWithPredBenchmark, truth=Price, estimate=.pred)

## # A tibble: 3 × 3
##   .metric .estimator  .estimate
##   <chr>   <chr>           <dbl>
## 1 rmse    standard   136432.   
## 2 rsq     standard        0.715
## 3 mae     standard   104047.

To evaluate the predictive quality based on the testing data, we use the same code applied to the testing data:

DataTestWithPredBenchmark=augment(WFModelBenchmark, DataTest)
metrics(DataTestWithPredBenchmark, truth=Price, estimate=.pred)

## # A tibble: 3 × 3
##   .metric .estimator     .estimate
##   <chr>   <chr>              <dbl>
## 1 rmse    standard   99940240.    
## 2 rsq     standard          0.0215
## 3 mae     standard    1719470.

The average over/underestimation (mae) based on the training data is about $104,000 while the same metric based on the testing data is about $1,719,500. This is a strong indication of an overfitting problem.

7.4 The Idea Behind Regularization

In Chapter 6, we tuned hyper-parameters to avoid overfitting. In this chapter, we will tackle the problem with regularization.

Regularization

Regularization is a technique applied during a model’s calibration. The goal is to generate optimal model-parameters $(\beta s)$ that are smaller than the ones from the related unregularized model — possibly zero.

Small model-parameters weaken the influence of the associated predictor variables or eliminate their influence if the parameter is zero.

A model with fewer variables or seriously weakened influence of some variables is less flexible, and therefore overfitting becomes less likely.

In essence, regularization minimizes or eliminates the influence of predictor variables with little explanatory power on the output variable, thereby improving predictive performance.

The goal of regularization — to generate smaller or zero $\beta$ parameters — leads to the following question:

How can we influence the Optimizer to produce smaller model-parameters than the ones that minimize the Mean Squared Error?

The answer is that we must give the Optimizer a different goal. Such a goal is formalized in the target function below:

\[\begin{eqnarray} T^{arget}&=&\frac{1}{20}\sum_{i=1}^{20} \left ( \widehat{Price}_i-Price_i\right)^2+\lambda P^{enalty} \tag{7.3}\\ \mbox{with:}&& \widehat{Price}_i=\beta_1 Sqft_i+\beta_2 Sqft_i^2+\beta_3 Sqft_i^3+\beta_4 Sqft_i^4+\beta_5 Sqft_i^5+\beta_6 \nonumber \end{eqnarray}\]

You can see that now the target function consists of the $MSE$ and a penalty term $(P^{enalty})$. Hence, we call it an target function rather than an error function. The goal for the Optimizer is now twofold:

Minimizing the $MSE$.
Minimizing a penalty value which is high when the model-parameters (the $\beta s$) are large and low otherwise.

The $MSE$ and the $P^{enalty}$ still only depend on the values for the $\beta s$. Therefore, the Optimizer can minimize the target function (7.3) as before with systematic trial-and-error.

The hyper-parameter $\lambda$ ($0 \le \lambda < +\infty$) determines the strength of the $P^{enalty}$ relative to the $MSE$.

Two major approaches exist to quantify the penalty:

Lasso:

The penalty is calculated as the sum of all (absolute) $\beta$ values, except the one for the intercept $(\beta_6)$:

\[\begin{equation} P_{Lasso}^{enalty}=\sum_{j=1}^{5} \rvert \beta_j \rvert \tag{7.4} \end{equation}\]

Note that reducing a large or a small $\beta$ parameter by the same amount has the same impact on the penalty.

Ridge:

The penalty is calculated as the sum of all squared $\beta$ values, except the one for the intercept $(\beta_6)$:

\[\begin{equation} P_{Ridge}^{enalty}=\sum_{j=1}^{5} \beta_j^2 \tag{7.5} \end{equation}\]

Note that because of the squaring of the $\beta s$, large parameters have an over-proportional impact on the penalty. Thus, reducing a large $\beta$ parameter by one unit has a bigger impact on the penalty than reducing a smaller $\beta$ parameter by one unit. We will see later that this is an important difference between Lasso and Ridge.

7.4.1 Lasso Regularization

When you substitute the definition for the Lasso penalty from equation (7.4) into the target function (7.3) you get the target function for the Lasso approach:

\[\begin{eqnarray} T^{arget}_{Lasso}&=&\underbrace{\frac{1}{20}\sum_{i=1}^{20} \left ( \widehat{Price}_i-Price_i\right)^2}_{MSE_{OLS}} + \lambda\underbrace{\sum_{i=j}^{5} \rvert \beta_j \rvert}_{P^{enalty}_{Lasso}} \tag{7.6}\\ \mbox{with:}&& \widehat{Price}_i=\beta_1 Sqft_i+\beta_2 Sqft_i^2+\beta_3 Sqft_i^3+\beta_4 Sqft_i^4+\beta_5 Sqft_i^5+\beta_6 \nonumber \end{eqnarray}\]

You can see again that the target $(T^{arget}_{Lasso})$ only depends on the $\beta$ values because the values for the prices of the houses $(Price_i)$ and their square footage $(Sqft_i)$ are known from the training data. Consequently, the Optimizer can again use a systematic trial-and-error process to find the $\beta$ values that minimize the target function (7.6).

During the process, the Optimizer considers both the $MSE$ and the $P^{enalty}_{Lasso}$ simultaneously. Therefore for all $\lambda>0$ the optimal $\beta$ parameters will be smaller than in the case of an unregularized optimization $(\lambda=0)$.

Below you will find the R code to create a Lasso model design and a fitted workflow model. For the recipe that created the various powers of the $Sqft$ and normalized the resulting predictors, we used the recipe from Section 7.3:

library(glmnet)
set.seed(777)
ModelDesignLasso=linear_reg(penalty=500, mixture=1) |>
                 set_engine("glmnet") |> 
                 set_mode("regression")

WFModelLasso=workflow() |> 
             add_model(ModelDesignLasso) |> 
             add_recipe(RecipeHouses) |> 
             fit(DataTrain)

Note, for the model design we again used the linear_reg() command, but in contrast to Section 7.3, where we used the lm package to create the unregularized model, we now use the glmnet package (see set_engine("glmnet")) to create a regularized model.

The glmnet package is a highly efficient package developed by Friedman, Tibshirani, and Hastie (2010). It supports multiple machine learning algorithms including Lasso (with linear_reg(mixture=1)) and Ridge with (with linear_reg(mixture=0)). We will discuss the hyper-parameter mixture in more detail in Section 7.4.3. In the code block above, we set mixture=1 to work with a Lasso model.

The other hyper-parameter penalty= in the code block above stands for the $\lambda$ in Equation (7.6). This tidyverse hyper-parameter name is a little confusing because it determines the strength of the penalty but not the penalty itself. Just remember, penalty and $\lambda$ are essentially the same.

The hyper-parameter named penalty can either be set to a specific numerical value or be determined in a tuning process. In the latter case, tuning and regularization are combined. In the interactive project in Section 7.4.3 you will tune the penalty (the $). In the code block above, we arbitrarily chose penalty=500 to keep things simple.

Since the workflow WFModelLasso is a fitted workflow model (it was calibrated with the training data), we can extract the model parameters with the tidy() command:

tidy(WFModelLasso)

## # A tibble: 6 × 3
##   term        estimate penalty
##   <chr>          <dbl>   <dbl>
## 1 (Intercept)  509945.     500
## 2 Sqft        -460508.     500
## 3 Sqft2       1171967.     500
## 4 Sqft3             0      500
## 5 Sqft4             0      500
## 6 Sqft5       -560318.     500

Notice that after including the Lasso penalty, the $\beta$ parameters (see column estimate) are smaller than those from the unregularized benchmark model in Section 7.3.⁴³ Two model-parameters ($\beta_3$ and $\beta_4$) are equal to zero, essentialy eliminating the related predictor variables $Sqft^3$ and $Sqft^4$ from the prediction equation. This is more obvious when you plug in the Lasso $\beta$ values into Equation (7.1):

\[\begin{eqnarray} \widehat{Price}_i&=&-460508 \cdot Sqft_i+ 1171967 \cdot Sqft_i^2+ 0 \cdot Sqft_i^3+\nonumber \\ &&0 \cdot Sqft_i^4+ (-560318) \cdot Sqft_i^5+ 509945 \nonumber \\ \widehat{Price}_i&=&-460508 \cdot Sqft_i +1171967 \cdot Sqft_i^2 +(-560318) \cdot Sqft_i^5 + 509945 \nonumber \tag{7.7} \end{eqnarray}\]

The equation above is simpler and has smaller $\beta$ parameters than the related prediction equation for an unregularized model which explains why overfitting is less likely.

To compare the predictive quality of the Lasso model to the benchmark model from Section 7.3, we first calculate the metrics based on the training for the Lasso model:

DataTrainWithPredLasso=augment(WFModelLasso,DataTrain)
metrics(DataTrainWithPredLasso, truth=Price, estimate=.pred)

## # A tibble: 3 × 3
##   .metric .estimator  .estimate
##   <chr>   <chr>           <dbl>
## 1 rmse    standard   144976.   
## 2 rsq     standard        0.679
## 3 mae     standard   110007.

Then, we calculate the metrics for the Lasso model based on the testing data:

DataTestWithPredLasso=augment(WFModelLasso, DataTest)
metrics(DataTestWithPredLasso, truth=Price, estimate=.pred)

## # A tibble: 3 × 3
##   .metric .estimator    .estimate
##   <chr>   <chr>             <dbl>
## 1 rmse    standard   4723086.    
## 2 rsq     standard         0.0296
## 3 mae     standard    303118.

Let us look at the average over/underestimation of the house prices (mae) and compare the Lasso model to the unregularized benchmark model from Section 7.3. You can see that based on the training data the mae increased from about $104,000 to $110,000 but more importantly based on the testing data the mae decreased from about $1,719,500 to $303,100. These results suggest better predictive performance of the Lasso model and, more importantly, indicate that the overfitting problem is most likely mitigated.

This figure shows the value for each model parameter at the vertical axis and the Lasso Penalty at the horizontal axis. You can see how the parameters become smaller when the penalty increases.

FIGURE 7.1: Parameter Estimates for Different Lasso Penalty Estimates

So far, we have only developed $\beta$ estimates for a penalty of $\lambda=500$. Figure 7.1 shows the $\beta$ estimates for the related predictor variables based on penalty values ranging from $\lambda=19$ to $\lambda=190900$.⁴⁴

As expected, when the penalty ($\lambda$; corresponds to hyper-parameter penalty in tidymodels) increases, the sum of the calibrated absolute $\beta$ values decreases. You can also see that all $\beta s$ decrease eventually to zero when the $\lambda$ increases enough. However, before this happens to all $\beta$ values, some $\beta s$ decrease to zero already at a lower $\lambda$ level, successively eliminating the related predictor variable from the analysis.

Let us explain this phenomenon with an example. When $\lambda=500$, the parameters for the predictor variables $Sqft^3$ and $Sqft^4$ are already zero ($\beta_3=0$ and $\beta_4=0$), the $\beta$ values for $\beta_1$, $\beta_2$, and $\beta_5$ are -460508, 1171967 and -560318, respectively. You can confirm this in Figure 7.1 when you imagine a vertical line at $\lambda=500$.

When you move rightwards in the diagram in Figure 7.1, $\lambda$ increases, and with it, the influence of $P^{enalty}_{Lasso}$ on $T^{arget}_{Lasso}$ in the target function (7.6). To compensate the Optimizer needs to lower one or more $\beta$ values. Since it does not matter for $P^{enalty}_{Lasso}$ which parameter will be lowered, the Optimizer will successively lower those $\beta$ parameters that create the smallest damage to the $MSE$ when lowered.

7.4.2 Ridge Regularization

In this section, we will work with Ridge regularization. As you will see, this approach differs only slightly from the Lasso approach we introduced in Section 7.4.1.

The most significant difference is that Ridge uses a different method to penalize large $\beta$ values. Instead of adding up the absolute values to calculate the penalty, Ridge squares the $\beta$ values before adding them up to calculate the Ridge penalty $(P^{enalty}_{Ridge})$.

This becomes more obvious when you substitute the definition for the Ridge penalty from equation (7.5) into the target function (7.3) resulting in the target function for the Ridge approach:

\[\begin{eqnarray} T^{arget}_{Ridge}&=&\underbrace{\frac{1}{20}\sum_{i=1}^{20} \left ( \widehat{Price}_i-Price_i\right)^2}_{MSE} + \lambda\underbrace{\sum_{i=j}^{5} \beta_j^2 }_{P^{enalty}_{Ridge}} \tag{7.8}\\ \mbox{with:}&& \widehat{Price}_i=\beta_1 Sqft_i+\beta_2 Sqft_i^2+\beta_3 Sqft_i^3+\beta_4 Sqft_i^4+\beta_5 Sqft_i^5+\beta_6 \nonumber \end{eqnarray}\]

Since the $\beta$ parameters are now squared when contributing to the Ridge penalty, larger $\beta$ values have an over-proportional impact on the penalty, a major difference compared to the Lasso approach.

When using the Ridge model with tidymodels, we can use the same code as in the Lasso Section 7.4.1. The only two changes are that we set the argument mixture= in the linear_reg() command to mixture=0, indicating that we use a Ridge model⁴⁵ and that we arbitrarily set the argument penalty=1000000 $(\lambda=1000000)$.

set.seed(777)
ModelDesignRidge=linear_reg(penalty=1000000, mixture=0) |>
                 set_engine("glmnet") |> 
                 set_mode("regression")

WFModelRidge=workflow() |> 
             add_model(ModelDesignRidge) |> 
             add_recipe(RecipeHouses) |> 
             fit(DataTrain)

For the recipe that creates the various powers of the $Sqft$ and normalizes the resulting predictors, we again use the recipe from Section 7.3.

Since the workflow WFModelRidge is a fitted workflow model (it was calibrated with the training data), we can extract the model parameters with the tidy() command:

tidy(WFModelRidge)

## # A tibble: 6 × 3
##   term        estimate penalty
##   <chr>          <dbl>   <dbl>
## 1 (Intercept)  509945. 1000000
## 2 Sqft          25790. 1000000
## 3 Sqft2         23133. 1000000
## 4 Sqft3         19885. 1000000
## 5 Sqft4         16968. 1000000
## 6 Sqft5         14570. 1000000

The $\beta$ parameters (see column estimate) are smaller than the ones from the unregularized benchmark model in Section 7.3. However, none of the $\beta$ parameters is equal to zero. So, none of the predictor variables is eliminated. This is a major difference between Ridge and Lasso.

To evaluate the Ridge model’s predictive quality and to compare the performance with the benchmark model from Section 7.3 and the Lasso model from Section 7.4.1 we calculate the Ridge model’s metrics based on the testing data:

DataTestWithPredRidge=augment(WFModelRidge, DataTest)
metrics(DataTestWithPredRidge, truth=Price, estimate=.pred)

## # A tibble: 3 × 3
##   .metric .estimator  .estimate
##   <chr>   <chr>           <dbl>
## 1 rmse    standard   330485.   
## 2 rsq     standard        0.237
## 3 mae     standard   186431.

Let us look at the average over/underestimation of the house prices (mae) and compare the Ridge model to the Lasso model. You can see that based on the testing data, the mae for the Ridge model is about $186,400 which is lower than the one for the Lasso model ($303,100). Both models outperform by far the unregularized model from Section 7.3, which produced a testing data mae of about $1,719,500, likely due to overfitting.

This figure shows the value for each model parameter at the vertical axis and the Ridge Penalty at the horizontal axis. You can see how the parameters become smaller when the penalty increases.

FIGURE 7.2: Parameter Estimates for Different Ridge Penalty Estimates

Figure 7.2 shows the $\beta$ estimates of the related predictor variables for penalty multipliers ranging from $\lambda=19,089$ to $\lambda=190,885,312$.⁴⁶

As in the Lasso model, when $\lambda$ increases, the calibrated $\beta$ values have a tendancy to move towards zero. In contrast to the Lasso model (see Figure 7.1), individual predictor variables are not eliminated (individual $\beta$ values are not set to zero). Instead, you can see in Figure 7.2 that the paths for the $\beta$ values converge to a common value. Afterward, when further increasing $\lambda$, this common value decreases further, until all $\beta$ values reach zero — none of the $\beta s$ reaches zero individually like in the Lasso case. Consequently, none of the predictor variables are eliminated individually.

This phenomenon can be explained by the Ridge penalty term. When $\lambda$ is large, the penalty term has a high weight compared to the $MSE$. Consequently, the Optimizer focuses mainly on the penalty term. Since larger (in absolute terms) $\beta$ values impact the penalty over-proportionally, the Optimizer has an incentive to lower these values first. This leads to a situation where $\beta$ values converge to simlar values in Figure 7.2, for large $\lambda$ values.

7.4.3 Elastic-Net — Combining Lasso and Ridge

In the previous two sections, we introduced Lasso and Ridge. Lasso tends to reduce $\beta$ parameter values to zero. Consequently, it eliminates predictors from the model and makes overfitting less likely. It is the model of choice if you want to reduce the number of predictors to those with the strongest explanatory power. In contrast, Ridge lowers the $\beta$ parameter values of all predictor variables to avoid overfitting without eliminating predictor variables. If you like to keep all predictor variables in the model, Ridge is the model of choice to avoid overfitting.

Often, we are ambivalent about keeping all predictor variables or not. The goal is to maximize predictive performance and to minimize overfitting. Therefore, it is often difficult to determine which algorithm is superior. So, why not combine Lasso and Ridge?

This is exactly what the Elastic-Net approach does. It uses both — the penalty for Lasso and the one for Ridge — and adds their weighted average to the $MSE$ in the target function:

\[\begin{eqnarray} T^{arget}_{Elastic}=\underbrace{\frac{1}{20}\sum_{i=1}^{20} \left ( \widehat{Price}_i-Price_i\right)^2}_{MSE} + \lambda\biggl(\overbrace{ \alpha P^{enalty}_{Lasso} + (1-\alpha)P^{enalty}_{Ridge}}^{P^{enalty}_{Elastic}}\biggr)\tag{7.9}\\ T^{arget}_{Elastic}=\underbrace{\frac{1}{20}\sum_{i=1}^{20} \left ( \widehat{Price}_i-Price_i\right)^2}_{MSE} + \lambda\biggl(\overbrace{ \alpha\underbrace{ \sum_{i=j}^{5} \rvert \beta_j \rvert}_{P^{enalty}_{Ridge}} + (1-\alpha)\underbrace{\sum_{i=j}^{5} \beta_j^2}_{P^{enalty}_{Lasso}}}^{P^{enalty}_{Elastic}}\biggr)\tag{7.10} \end{eqnarray} $$\mbox{with: } \widehat{Price}_i=\beta_1 Sqft_i+\beta_2 Sqft_i^2+\beta_3 Sqft_i^3+\beta_4 Sqft_i^4+\beta_5 Sqft_i^5+\beta_6$$\]

In Equation (7.9) you can see that the penalty for $P^{enalty}_{Elastic}$ is calculated as a weighted average from the penalties $P^{enalty}_{Lasso}$ and $P^{enalty}_{Ridge}$. The hyper-parameter $\alpha$ $(0\le\alpha\le 1)$ determines the share of the Lasso penalty. For example, if $\alpha=0.5$ then the share of the Lasso penalty is 50% and the share of the Ridge penalty $((1-\alpha)=0.5)$ is also 50%. If $\alpha=0.3$ then the share of the Lasso penalty is 30%, and the share of the Ridge penalty $((1-\alpha)=0.7)$ is 70%.

The hyper-parameter $\alpha$ corresponds to the argument mixture= in the linear_reg() command. You worked with it already in Section 7.4.1 when you set mixture=1 assigning a share of 100% to the Lasso penalty and a share of 0% $((1-\alpha)=0)$ to the Ridge penalty — basically running a pure Lasso model.

In Section 7.4.2 you ran a pure Ridge model by setting mixture=0 $(\alpha=0)$ and thus implying that the share of the Ridge penalty is 100% $((1-\alpha)=1)$.

The advantage of the Elastic-Net approach is that you can run any mixture of Lasso and Ridge regularization. The disadvantage is that you get with $\alpha$ (mixture) another hyper-parameter that either needs to be set in the linear_reg() command or that needs to be tuned together with the hyper-parameter $\lambda$ (penalty).

You will tune these hyper-parameters later in an interactive project in Section 7.5. Here, in this section, we implement Elastic-Net into tidymodels by arbitrarily setting the hyper-parameters to keep things simple.

The commands to set up a tidymodels workflow for the Elastic-Net model are almost the same as in the previous section, except that we arbitrarily set penalty=10000 and mixture=0.5.

set.seed(777)
ModelDesignElastNet=linear_reg(penalty=10000, mixture=0.5) |>
                    set_engine("glmnet") |> 
                    set_mode("regression")

WFModelElastNet=workflow() |> 
                add_model(ModelDesignElastNet) |> 
                add_recipe(RecipeHouses) |> 
                fit(DataTrain)

For the recipe that creates the various powers of the $Sqft$ and normalizes the resulting predictors, we again use the recipe from Section 7.3.

Since the workflow WFModelRidge is a fitted workflow model (it was calibrated with the training data), we can extract the model parameters with the tidy() command:

tidy(WFModelElastNet)

## # A tibble: 6 × 3
##   term        estimate penalty
##   <chr>          <dbl>   <dbl>
## 1 (Intercept)  509945.   10000
## 2 Sqft         151417.   10000
## 3 Sqft2         89828.   10000
## 4 Sqft3             0    10000
## 5 Sqft4             0    10000
## 6 Sqft5        -58923.   10000

As expected, the $\beta$ parameters (see estimate column) are much smaller than those from the unregularized benchmark model in Section 7.3. The $\beta_3$ and $\beta_4$ parameters were set to zero, eliminating the related predictor variables $Sqft^3$ and $Sqft^4$ from the prediction equation. This reflects the influence of the Lasso approach. The model-parameters $\beta_1$, $\beta_2$, and $\beta_5$ are reduced to a similar range when ignoring the sign. This reflects the influence of the Ridge approach.

To evaluate if the Elastic-Net with Lasso and Ridge equally presented (mixture=0.5) can outperform the individual models, we will assess the Elastic-Net model’s predictive performance based on the testing data and compare with the performance of the Lasso and Ridge models:

Like before, we calculate the model’s metrics for the testing data using the augment() and metrics() commands:

DataTestWithPredElastNet=augment(WFModelElastNet, DataTest)
metrics(DataTestWithPredElastNet, truth = Price, estimate = .pred)

## # A tibble: 3 × 3
##   .metric .estimator    .estimate
##   <chr>   <chr>             <dbl>
## 1 rmse    standard   600767.     
## 2 rsq     standard        0.00664
## 3 mae     standard   180716.

As you can see, based on the testing data, the Elastic-Net model over/underestimates the house prices on average by about $180,700 (mae). For comparison the mae for Lasso was $303,100 and the mae for Ridge was $186,400.

7.5 🧭Project: Predicting House Prices with Elastic-Net

Interactive Section

In this section, you will find content together with R code to execute, change, and rerun in RStudio.

The best way to read and to work with this section is to open it with RStudio. Then you can interactively work on R code exercises and R projects within a web browser. This way you can apply what you have learned so far and extend your knowledge. You can also choose to continue reading either in the book or online, but you will not benefit from the interactive learning experience.

To work with this section in RStudio in an interactive environment, follow these steps:

Ensure that both the learnR and the shiny package are installed. If not, install them from RStudio’s main menu (Tools -> Install Packages $\dots$).
Download the Rmd file for the interactive session and save it in your project folder. You will find the link for the download below.
Open the downloaded file in RStudio and click the Run Document button, located in the editing window’s top-middle area.

For detailed help for running the exercises including videos for Windows and Mac users we refer to: https://blog.lange-analytics.com/2024/01/interactsessions.html

Do not skip this interactive section because besides providing applications of already covered concepts, it will also extend what you have learned so far.

Below is the link to download the interactive section:

https://ai.lange-analytics.com/exc/?file=07-RidgeLassoExerc100.Rmd

In the previous section, we worked with a univariate Polynomial Model to demonstrate regularization and to evaluate overfitting scenarios. In the real world, data scientists usually use multivariate models to include more than one variable in their analysis. Therefore, this interactive exercise will teach you how to use a multivariate approach.

7.5.1 The Data

You will again predict house prices using the King County House Sale dataset (Kaggle (2015)). In the code block below, the dataset is loaded. Instead of selecting only one variable to predict the price of a house $(Price_i)$, we use the select() command to choose the number of bedrooms $(Bedr_i)$ and the year a house was built $(Year)$, together with the house’s square footage $(Sqft)$. We filter out houses with a price of more than $1,300,000 and square footage 4,500 sqft or more, which are outliers:

library(tidymodels); library(rio); library(janitor); library(glmnet)

DataHousing=
  import("https://ai.lange-analytics.com/data/HousingData.csv")|>
  clean_names("upper_camel") |> 
  select(Price, Sqft=SqftLiving, Bedr=Bedrooms, Year=YrBuilt) |> 
  filter(Price<=1300000, Sqft<4500) 

set.seed(777)
Split005=initial_split(DataHousing, prop=0.005, strata=Price, breaks=5)
DataTrain=training(Split005)
DataTest=testing(Split005)

Note when creating the training dataset, we use the argument prop=0.005, leading to a training dataset with about 100 observations. This size is small, but in general realistic for a training dataset. However, given a total of 20,699 available observations, we could have used the more common prop=0.7. In that case, we would end up with 14,489 observations for the training dataset. With such a large training dataset, the unregularized model we use later in this section would likely not lead to overfitting, and thus regularization would no longer be needed — which would defeat the purpose of this interactive exercise.

7.5.2 Unregularized Benchmark Model

As a benchmark, we begin with an unregularized model. In this case, a multivariate Degree-2 Polynomial Model. It includes the variables $Sqft$, $Bedr$, and $Year$. They are used in their original form, squared, and as various interaction terms:

\[\begin{eqnarray} Price&=& \beta_1 Sqft+\beta_2 Bedr+\beta_3 Year\nonumber \\ &+&\beta_4 Sqft^2+\beta_5 Bedr^2+\beta_6 Year^2\nonumber \\ &+&\beta_7 Sqft Bedr+\beta_8 Sqft Year+\beta_9 Bedr Year\nonumber \\ &+&\beta_{10} Sqft^2 Bedr+\beta_{11} Sqft Bedr^2+\beta_{12} Sqft^2 Bedr^2\nonumber \\ &+&\beta_{13} Sqft^2 Year+\beta_{14} Sqft Year^2+\beta_{15} Sqft^2 Year^2\nonumber \\ &+&\beta_{16} Bedr^2 Year+\beta_{17} Bedr Year^2+\beta_{18} Bedr^2 Year^2 \tag{7.11} \end{eqnarray}\]

As before, we use a recipe to generate and normalize the additional predictors:

RecipeHousesMultivar=recipe(Price~., data=DataTrain) |> 
                     step_mutate(Sqft2=Sqft^2,Bedr2=Bedr^2,Year2=Year^2,
                       SqftxBedr=Sqft*Bedr,SqftxYear=Sqft*Year, 
                       BedrxYear=Bedr*Year,Sqft2xBedr=Sqft2*Bedr, 
                       SqftxBedr2=Sqft*Bedr2,Sqft2xBedr2=Sqft2*Bedr2,
                       Sqft2xYear=Sqft2*Year,SqftxYear2=Sqft*Year2,
                       Sqft2xYear2=Sqft2*Year2,Bedr2xYear=Bedr2*Year,
                       BedrxYear2=Bedr*Year2,Bedr2xYear2=Bedr2*Year2) |> 
                     step_normalize(all_predictors())

Afterward, we create the model design (ModelDesignBenchmark), the workflow model (WFModelBenchmark), and we calibrate WFModelBenchmark to the training data:

ModelDesignBenchmark=linear_reg() |> 
                     set_engine("lm") |> 
                     set_mode("regression")

WFModelBenchmark=workflow() |> 
                 add_model(ModelDesignBenchmark) |> 
                 add_recipe(RecipeHousesMultivar) |> 
                 fit(DataTrain)

When you execute the tidy() command in the code block below, you will see that all predictor variable parameters ($\beta s$), except the one for the intercept, are in the multi-million range, and some even in the multi-billion range (see column estimate).

tidy(WFModelBenchmark)

When you execute the code block below, the model’s training and testing data predictions are compared to the true values ($Price_i$), and the related metrics are calculated:

DataTrainWithPredBenchmark=augment(WFModelBenchmark,DataTrain)
print("Metrics for Training Data:")
metrics(DataTrainWithPredBenchmark, truth=Price, estimate=.pred)

DataTestWithPredBenchmark=augment(WFModelBenchmark,DataTest)
print("Metrics for Testing Data:")
metrics(DataTestWithPredBenchmark, truth=Price, estimate=.pred)

After executing the code above, you will see that the mae for the training data is about $111,700 and the mae for the testing data is about $142,800. Such a low training mae, when compared to the testing mae, strongly indicates overfitting.

In the following section, you will use an Elastic-Net model to improve the predictive quality.

7.5.3 Regularized Elastic-Net Polynomial Model

When you substitute the $\dots$ after mixture= and penalty= in the code further below with values of your choice and execute, the model’s $\beta$ values will be calibrated and printed. Also, the metrics for the testing data are calculated and printed.

Recall, if mixture=0, you will run a Ridge model; if mixture=1, you will run a Lasso model. For anything in between, you will run a mixed model.

Finding an appropriate value for the argument penalty= is more tricky. For the Ridge penalty, values between $15{,}000$ and $154{,}000{,}000$ will work. For the Lasso penalty, values between $15$ and $154{,}000$ will work.⁴⁷

Try a few settings and see how the results are changing. For example, use mixture=1 (Lasso) and penalty=100000. Afterward, increase and decrease penalty= and leave mixture=1 constant. How are the results changing? Then try mixture=0 and leave penalty=100000 to see the different results for a Ridge model. You can also set mixture=0.5 or mixture=0.7 to try out different Elastic-Net models. Playing with different hyper-parameter values and observing the resulting $\beta s$ and metrics will help you to better understand the $Ridge$, $Lasso$, and Elastic-Net approaches.

ModelDesignRegularized=linear_reg(penalty=... , mixture=...) |> 
                       set_engine("glmnet") |> 
                       set_mode("regression")

WFModelRegularized=workflow() |> 
                   add_model(ModelDesignRegularized) |> 
                   add_recipe(RecipeHousesMultivar) |> 
                   fit(DataTrain)

print("Beta Values (see Column Estimate:)")
tidy(WFModelRegularized)

DataTestWithPredRgularized=augment(WFModelRegularized, DataTest)
print("Metrics for Testing Data:")
metrics(DataTestWithPredRgularized, truth=Price, estimate=.pred)

Warning: In the code block above, you changed hyper-parameters and evaluated the effects based on the testing data to better understand regularization. Never use the testing data to find the best hyper-parameters. Otherwise, you end up with a model that is highly specialized to the testing data, but when used on new data in the production phase, it would likely perform very poorly.

Instead, use validation data derived from the training dataset to evaluate the performance of hyper-parameters. This is the underlying principle of tuning, and in the next section, you will tune the workflow model to improve performance.

7.5.4 Tuning the Elastic-Net Polynomial Model

In this section, you will use the 10-Step Template to tune the hyper-parameters penalty and mixture.

Step 1 - Generate Training and Testing Data:

The training and testing data have already been generated (see Section 7.5.3 above).

Step 2 - Create a Recipe:

The recipe to create the predictors for the multivariate Degree-2 Polynomial model (RecipeHousesMultivar) was created in connection with the unregularized model and can be reused here. It establishes the predictors with various powers and the interaction terms.

Step 3 - Create a Model Design:

The model design from the previous section needs slight modifications. We have to add the tune() placeholder for the hyper-parameters:

ModelDesignElastNetTune =linear_reg(penalty=tune(), mixture=tune()) |> 
                         set_engine("glmnet") |> 
                         set_mode("regression")

Step 4 - Add the Recipe and the Model Design to a Workflow:

As before, we add the recipe and the (modified) model design to the workflow:

TuneWFModel=workflow() |>
            add_recipe(RecipeHousesMultivar) |>     
            add_model(ModelDesignElastNetTune)

Step 5 - Create a Hyper-Parameter Grid:

Previously, when only one hyper-parameter was tuned, you added the values to be tried out in a data frame column named with the same name as the hyper-parameter. When you have more than one hyper-parameter, it gets a little more complicated because you want to try out different combinations of hyper-parameters.

If you decide to try out all combinations of the values from two or more hyper-parameters, you can use the crossing() command. The arguments of the crossing() command are the hyper-parameters you would like to try out. The assigned values to these arguments are the values you want to try out for each of the hyper-parameters. They are assigned to each argument as R vector objects. The crossing() command then generates all possible combinations for the values of the hyper-parameters.

For example, in this project you will later try 66 penalty values from 15,000 to 80,000 with an increment of 1,000 and three mixture values (0, 0.5, and 1). You can use the crossing() command to generate all possible combinations:

ParGridHouses=crossing(mixture=c(0, 0.5, 1),
                       penalty=seq(15000, 80000, 1000))
print(ParGridHouses)

## # A tibble: 198 × 2
##    mixture penalty
##      <dbl>   <dbl>
##  1       0   15000
##  2       0   16000
##  3       0   17000
##  4       0   18000
##  5       0   19000
##  6       0   20000
##  7       0   21000
##  8       0   22000
##  9       0   23000
## 10       0   24000
## # ℹ 188 more rows

The R vector object assigned to mixture is created with the c() command by listing the related values. The R vector object assigned to penalty is created with the seq() command. The first value in seq() determines where the sequence starts ($15{,}000$) and the second where it ends ($80{,}000$). The third argument determines the increment (the step size; $1{,}000$).

You can see in the printout above that crossing() generates all possible combinations of the penalty values and the mixture values. Each of these 198 combinations will be tried out when you tune the workflow later on.

The crossing() command is very helpful when you combine multiple hyper-parameters with several values each. Imagine you want to create a grid with all combinations of five hyper-parameters with ten values each, crossing() would create $100{,}000$ parameter combinations ($10\cdot 10\cdot 10\cdot 10\cdot 10=10^5$) for you. Be advised, tuning $100{,}000$ hyper-parameter combinations would use a lot of computing time.

In general, computing time increases exponentially with the number of hyper-parameters, while computing time increases proportionally when the values for each hyper-parameter increase. These facts must be considered when deciding how many hyper-parameters you want to tune and how many values you like to consider for each hyper-parameter.

Step 6: - Create Resamples for Cross-Validation:

To create resamples for Cross-Validation, we use the vfold() command. However, although usually recommended, creating ten Cross-Validation folds creates a problem. Given that we have only 101 observations, the validation set of each fold would consist only of about ten observations. Therefore, we will create sets of three folds (v=3), which leaves about $30$ observation for each validation dataset. To offset the small number of folds, we repeat the process of creating folds ten times (see the argument repeats=10) creating a total of $30$ folds:

FoldsForTuning=vfold_cv(DataTrain, v=3, strata=Price, repeats=10)

Step 7 – Step 10:

Now, you are ready to finalize the tuning process in the interactive project. The training and testing data (from Step 1), the workflow including the recipe and the model design (from Steps 2 – 4), as well as the resamples for validation (from Step 6) have already been loaded in the background.

In the code block below, you only have to define the data frame with the parameter grid as explained in Step 5 to try out various combinations for the hyper-parameters penalty= and mixture=.

For the first time you run the code, it is recommended to use the values as determined in Step 5 (you can later change them). Note, executing the code can take a while because penalty=seq(15000,80000,1000) will create 66 penalty values, and mixture=seq(0,1,0.5) will create 3 mixture values. The crossing() command will then create 198 combinations ($66\cdot3$) which will be all tried out and evaluated in the tuning process in Step 7.

When R outputs the results, look carefully at the $\beta$ parameters, the plot generated by autoplot() after Step 7, and the metrics for the testing data. Then change the c() and seq() commands that generate the hyper-parameter values for penalty= and mixture=. Afterward, execute the code again and see how the results change.

Please recall, that for this setup Ridge penalty values between $15{,}000$ and $154{,}000{,}000$ will work and Lasso penalty values between $15$ and $154{,}000$ will work.⁴⁸

# Step 5 - Create a Hyper-Parameter Grid: 
#
# ParGridHouses contains the hyper-parameter combinations to be 
# tried out.
# Change the values for the seq() commands to try out different values.
ParGridHouses=crossing(penalty=seq(15000, 80000, 1000), 
                       mixture=seq(0, 1, 0.5))
print("Hyper-Parameter Combinations to be Tried Out:")
ParGridHouses

# Step 7: Tune the Workflow and Train All Models
set.seed(777)
TuneResultsElastNet=tune_grid(WFModelElastNetTune,  
                              resamples=FoldsForTuning, 
                              grid=ParGridHouses, 
                              metrics = metric_set(rmse))

# The diagram shows how the  *root mean squared error* (`rmse`) 
# changes with increasing regularization:
autoplot(TuneResultsElastNet)

# Step 8 - Extract the Best Hyper-Parameter(s):
BestHyperPara=select_best(TuneResultsElastNet, metric="rmse")
print("Best values for the Hyper-Parameters")
BestHyperPara

# Step 9: Finalize and Train the Best Workflow 
BestWFModel=finalize_workflow(WFModelElastNetTune, BestHyperPara) |> 
            fit(DataTrain)

print("Optimal Beta Parameters (see estimate column)")
tidy(BestWFModel)

# Step 10: Assess Prediction Quality Based on the Testing Data:
DataTestBestWFModelWithPred=augment(BestWFModel, DataTest)
print("Metrics for the Testing Data")
metrics(DataTestBestWFModelWithPred, truth = Price, estimate = .pred)

7.6 When and When Not to Use Ridge and Lasso Models

Regularization can be used for a wide range of machine learning models to avoid overfitting. Every machine learning model with an error function that depends on model-parameters ($\beta s$) can use regularization.

Regularization will decrease the values of model-parameters possibly all the way to zero. Smaller model-parameters will weaken (or eliminate) the influence of the related predictor variables and thus help to avoid overfitting.
Considering computing resources you should also consider that regularization adds one or more hyper-parameters that need to be tuned (one extra hyper-parameter for Lasso or Ridge and two for Elastic-Net).
If you have a large training dataset compared to the number of model-parameters to be optimized, regularization might not be necessary. However, you can still use regularization in an attempt to improve predictive quality further.
Using regularization for one or more hyper-parameters is often a judgment call that should be guided by the required computing time, and the expectation that smaller model-parameters improve predictive quality.
If you use regularization, the question arises which regularization approach to use.
- Use Lasso if your goal is to eliminate predictor variables from the model.
- Use Ridge to lower the values of all $\beta$ parameter values without eliminating predictor variables.
- Use Elastic-Net to combine Lasso and Ridge when you have no preference about eliminating predictor variables or not, and your main goal is to improve predictive performance and to lower overfitting. Consider that Elastic-Net adds an additional hyper-parameter (mixture) that might need tuning.
Some models, like k-Nearest Neighbors, do not optimize an error function and are thus unsuitable for regularization.

7.7 Digital Resources

Below you will find a few digital resources related to this chapter such as:

Videos
Short articles
Tutorials
R scripts

These resources are recommended if you would like to review the chapter from a different angle or to go beyond what was covered in the chapter.

Here we show only a few of the digital resourses. At the end of the list you will find a link to additonal digital resources for this chapter that are maintained on the Internet.

You can find a complete list of digital resources for all book chapters on the companion website: https://ai.lange-analytics.com/digitalresources.html

Regularization in R Tutorial: Ridge, Lasso and Elastic Net

A free DataCamp tutorial about Ridge, Lasso, and ElasticNet regularization that also discusses the trade-off between bias and variance in machine learning.

Link: https://ai.lange-analytics.com/dr?a=371

Lasso Regression Using tidymodels with data for “The Office”

A video and article by Julia Silge published in her blog TidyTuesday. The post describes how to use tidymodels to analyze data from the TV series “The Office”. A Lasso approach is tuned to regularize the model-parameters for a linear regression.

Link: https://ai.lange-analytics.com/dr?a=372

Regularization: What? Why? and How? (Part 1)

The first part of two articles by Siddant Rai in MLearning.ai. The author describes requirements for regularization in Part 1 and regularisation techniques in Part 2.

Link: https://ai.lange-analytics.com/dr?a=370

Linear Regression Does Not Need Normalization but Ridge/Lasso Regression Does

This blog post by Carsten Lange discusses normalization. It shows that although normalization is not needed for linear OLS regression, it is needed when a penalty term is used, including Lasso, Ridge, and Elastic-Net regressions. The blog post article is interactive and provides an R script with an intuitive example.

Link: https://ai.lange-analytics.com/dr?a=374

More Digital Resources

Only a subset of digital resources is listed in this section. The link below points to additional, concurrently updated resources for this chapter.

Link: https://ai.lange-analytics.com/dr/ridgelasso.html

References

Aden-Buie, Garrick, Barret Schloerke, and JJ Allaire. 2022. Learnr: Interactive Tutorials for r. https://CRAN.R-project.org/package=learnr.

Chan, Chung-Hong, Geoffrey C. H. Chan, Thomas J. Leeper, and Jason Becker. 2021. Rio: A Swiss-Army Knife for Data File i/o.

Chang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara Borges. 2022. Shiny: Web Application Framework for r. https://CRAN.R-project.org/package=shiny.

Firke, Sam. 2023. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.

Friedman, Jerome, Robert Tibshirani, and Trevor Hastie. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. https://doi.org/10.18637/jss.v033.i01.

Kaggle. 2015. “House Sales in King County, USA.” Online. https://www.kaggle.com/datasets/harlfoxem/housesalesprediction.

Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.

Tay, J. Kenneth, Balasubramanian Narasimhan, and Trevor Hastie. 2023. “Elastic Net Regularization Paths for All Generalized Linear Models.” Journal of Statistical Software 106 (1): 1–31. https://doi.org/10.18637/jss.v106.i01.

Zhu, Hao. 2021. kableExtra: Construct Complex Table with ’kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.

Other than comparing the size of the parameters between the two models, the individual $\beta$ parameters are not interpretable. This is because a univariate Degree-5 Polynomial model is already too complex to be interpreted.↩︎
The linear_reg() command from the glmnet package does not only calibrates the $\beta$ parameters based on the $\lambda$ chosen (penalty=500), it also, by default, calibrates for $100$ other $\lambda$ values. The results for Figure 7.1 were extracted with the command extract_fit_engine(WFModelLasso). We do not use these results for tuning because we will use the more advanced tidymodels tuning approach later in Section 7.5.↩︎
More about the mixture= argument in Section 7.4.3.↩︎
The linear_reg() command from the glmnet package does not only calibrate the $\beta$ parameters based on the $\lambda$ chosen (penalty=1000000), it also calibrates $100$ other $\lambda$ values. The results for Figure 7.2 were extracted with the command extract_fit_engine(WFModelLasso). We do not use these results for tuning because we will use the more advanced tidymodels tuning approach later in Section 7.5.↩︎
These recommendations are based on the 100 $\lambda$ values that glmnet analyzes by default, depending on the model and the training data. Values outside these ranges might produce surprising results.↩︎
These recommendations are based on the 100 $\lambda$ values that glmnet tries by default, depending on the model and the training data. Values outside these ranges might produce surprising results.↩︎