Chapter 2 Basics of Machine Learning

This is an Open Access web version of the book Practical Machine Learning with R published by Chapman and Hall/CRC. The content from this website or any part of it may not be copied or reproduced. All copyrights are reserved.

If you find an error, have an idea for improvement, or have a question, please visit the forum for the book. You find additional resources for the book on its companion website, and you can ask the experimental AI assistant about topics covered in the book.

If you enjoy reading the online version of Practical Machine Learning with R, please consider ordering a paper, PDF, or Kindle version to support the development of the book and help me maintain this free online version.

2.1 Learning Outcomes

This section outlines what you can expect to learn in this chapter. In addition, the corresponding section number is included for each learning outcome to help you to navigate the content, especially when you return to the chapter for review.

In this chapter, you will learn:

How Machine Learning compares to Artificial Intelligence and Deep Learning (see Section 2.2).
How machine learning algorithms can be categorized into regression, classification, and clustering algorithms. (see Section 2.3).
How to use important terminology correctly in machine learning (see Section 2.4).

2.2 Machine Learning, Artificial Intelligence, and Deep Learning

The image shows how Artificial Intelligence, Machine Learning, and Depp Learning are related

FIGURE 2.1: Artificial Intelligence, Machine Learning, and Deep Learning

The terms Artificial Intelligence, Machine Learning, and Deep Learning are often used synonymously. However, there is a clear distinction and hierarchy related to these terms (see Figure 2.1 for an overview):

Artificial Intelligence (AI): is defined as computer algorithms that perform tasks believed to be only performed by humans. This includes rule-based algorithms (e.g., detailed instructions for a robot to assemble a car) as well as data-based algorithms (e.g., to use historical data to calibrate a prediction model). Following this definition, AI applications include tasks such as advanced home automation, speech recognition (Natural Language Processing), Optical Character Recognition (OCR), self-driving cars, as well as systems that can produce art (see, for example, Dall-E).³
Machine Learning: is a sub-field of AI. It is data-based rather than rule-based.

Data-based systems create the rules based on training data that are used to calibrate the machine learning model. Based on the training data, the algorithm finds the underlying rules internally.

In contrast, rule-based systems rely on experts to define rules that determine what to do in which situation. Rule-based systems are not considered machine learning.

Well-known machine learning algorithms include Ordinary Least Square (OLS) Regression, Logistic Regression, k-Nearest Neighbors, Neural Networks, Random Forest, and various cluster algorithms.

Deep Learning: is a sub-field of machine learning. It uses Neural Networks as the core methodology.; A Neural Network processes training data through layers of weighted non-linear functions (neurons). A learning algorithm successively updates the parameters of the Neural Network to improve the prediction quality (more about Neural Networks in Chapter 9).; If a Neural Network has many neurons (often millions or more) and if these neurons are organized into many layers, often with varying functionalities, the underlying algorithm is called a Deep Learning algorithm. Applications that usually require Deep Learning are Natural Language Processing (NLP), advanced image recognition, and programming code completion.

Figure 2.1 shows how the above terms are related.

2.3 Machine Learning Tasks

Machine learning algorithms can be used for various purposes and can be categorized by the tasks they perform:

Regression:: A regression seeks to predict a continuous variable. The prediction goal is to predict how large (or small) a specific outcome variable is — based on the fact that we know the values of other variables. Regression analysis can be linear (OLS). Or regression analysis can be non-linear. Examples include Random Forest, Polynomial Regression, and Neural Networks.
Classification:: A classification algorithm seeks to predict a category. In many cases (including all cases covered in this book), the categories are limited to only two values, such as Yes/No, Red Wine/White Wine, or True/False. Most machine learning algorithms require the related categorical variables to be stored as an R factor variable (more about factor variables in Section 3.5.1).; Categories to be predicted can also have more than two values. These categories can be either ordered or not ordered. Examples of ordered categories include ratings such as Good/Fair/Poor or Strongly Agree/Agree/Disagree/Strongly Disagree. Examples of unordered categories include colors such as Red/Blue/Green or marital status such as Single/Married/Widowed/Separated.; Machine learning algorithms for classification include Logistic Regression, k-Nearest Neighbors, Random Forest, and Neural Networks.
Cluster Analysis:: An algorithm performing Cluster Analysis seeks to sort observations into a given number of groups (clusters) based on variable values for each observation. The characteristics of the clusters and how many clusters are generated are often not determined before the analysis. The goal of the Cluster Analysis algorithm is to form clusters that are as homogeneous as possible inside each cluster and are as diverse as possible between different clusters.; Machine learning algorithms for Cluster Analysis include Centroid-based Clustering algorithms such as k-Means, Hierarchical Clustering, and Distribution Clustering. However, Cluster Analysis exceeds the scope of this book. To learn more about Cluster Analysis please refer to Wong (2023) for a brief overview and to Hennig, Murtagh, and Rocci (2016) for a comprehensive book.

Looking at the above examples for regression and classification, you might have recognized that some algorithms can be used for both regression and classification (e.g., Random Forest and Neural Networks). In contrast, some algorithms can only be used exclusively for regression (e.g., OLS Regression) or classification (e.g., Logistic Regression).

2.4 Machine Learning Terminology

In this book, we will limit machine learning terminology to include only the essential terms. Terminology is important because when we communicate, we need to agree on the meaning of specific terms. This prevents misunderstandings and avoids defining terms over and over again.

In what follows, we define eight important machine learning terms which will be used frequently throughout the book:

Prediction:

Predicting is to use the values of one or more known variables to estimate the value of an unknown variable (outcome). Predictions can be forecasts of an event in the future. For example, we might predict tomorrow’s weather based on today’s weather and on today’s barometric change. However, Predictions are often not forecasts. For example, predicting the price of a house (today) based on its square footage (today).

Predictor vs. Outcome Variables:

A predictor variable is a variable that is used to predict an outcome for an observation. The values of predictor variables are known, and they are used to predict the value of the outcome variable for the related observation. If the value of an outcome is based on a prediction, we add a hat ($\widehat{\quad }$) on top of the outcome variable. In contrast, if we know the value of the outcome variable from the data, we omit the hat.

Here is an example: Assume we want to predict the outcome variable $Price_i$ for a specific single-family home (this house would be our observation $i$), and the prediction is based on the house’s square footage (predictor variable; $Sqft_i$). If the square footage of this house is $Sqft_i=2,000$, a machine learning model might predict a price of $\$750{,}000$ (outcome variable $\widehat{Price_i}=750{,}000)$). Suppose we know from the data that the true value for the outcome variable is $\$800{,}000$ ($Price_i=800{,}000$), then we know that we underestimated the house price by $\$50,000$ (more about prediction errors in Section 5.3).

Model:

A model is what we use to predict the value of an outcome variable based on values of predictor variables given certain assumptions.

Suppose we try to predict home prices based on the square footage of houses using OLS regression, which assumes the relationship between the outcome price ($Price_i$) and the predictor square footage ($Sqft_i$) is linear. Then the related model could be expressed as:

\[\begin{equation} \widehat{Price_i}=\beta_1 Sqft_i + \beta_2 \tag{2.1} \end{equation}\]

Equation (2.1) is our model. It models the relationship between the predictor $Sqft_i$ and the outcome $Price_i$. The model leads to a prediction of the house price $(\widehat{Price_i})$. We can use Equation (2.1) to calculate a predicted price for any value of the predictor variable $Sqft_i$, if we know the values for $\beta_1$ and $\beta_2$).

Fitted Model vs. Unfitted Model:

Let us return to our model from Equation (2.1). Can we use the model to predict the price of a house if we know the house’s square footage? The answer is “no” because we do not know the values for the $\beta s$ (called the parameters of the model).

However, we can use a machine learning algorithm to determine the values for the parameters (the $\beta s$). Machine learning models often generate parameter values that minimize the (squared) prediction error. Using the values for the parameters (the $\beta s$) provided by the machine learning algorithm allows for predicting the outcome variable based on the value(s) of the predictor variable(s).

For example, suppose an OLS algorithm based on provided data determines that $\beta_1=300$ and that $\beta_2=150,000$ then our model would look like this:

\[\begin{equation} \widehat{Price_i}=300 \cdot Sqft_i + 150000 \tag{2.2} \end{equation}\]

A model where the parameters (the $\beta s$) have been determined by a machine learning algorithm based on training data like the one in Equation (2.2) is called a fitted model. A model where the parameters have yet to be determined and that, therefore, cannot be used for prediction (yet) is called an unfitted model (e.g., Equation (2.1)).

For instance, the fitted model displayed in Equation (2.2) can be used to predict the price for a house with a square footage of 2,000 sqft (predictor: $Sqft_i=2,000$, predicted outcome: $\widehat{Price_i}=750,000$),

Parameters:

Parameters are the $\beta s$ of a model. Machine learning algorithms generate the best parameter values according to a predefined goal (e.g., minimize the average squared error for all predictions for the provided training dataset).

Consequently, machine learning can be (over)simplified into the following three steps:

Determine the model (most models are more complex than the one in Equation (2.1)).

Use a machine learning algorithm to determine the optimal $\beta s$ to create a fitted model.

Use the fitted model model to predict values for the outcome variable based on predictor variable values.

Training Data:

Calibrating the parameters (the $\beta s$) of a machine learning model is a critical step in the model’s calibration process. This process is called training a model and requires training data. A training dataset consists of the known values of the predictor variables and the corresponding known values of the outcome variable. Then, the training process calibrates the parameters to minimize the difference between the model’s predictions and the actual values of the outcome variable from the training data. In other words, the training process strives to minimize the prediction error (more about training data in Chapter 6.3).

Testing Data:

When using the training data to calibrate the parameters, most but not all of the observations are used. Those observations that are not used for training constitute the testing data. About 10% – 40% of the total observations are commonly used as testing data. They are usually randomly chosen from the complete dataset. Testing data are never used to optimize model performance in any way. Instead, they are a holdout dataset used to assess the predictive quality of a model.

Note, using training data to assess predictive quality is not an option! This is because we would measure how well the model can approximate the data it was trained with rather than assessing the predictive quality on new data. That is, the testing data the model has never seen before (more about testing data in Chapter 6.3).

2.5 Digital Resources

Below you will find a few digital resources related to this chapter such as:

Videos
Short articles
Tutorials
R scripts

These resources are recommended if you would like to review the chapter from a different angle or to go beyond what was covered in the chapter.

Here we show only a few of the digital resourses. At the end of the list you will find a link to additonal digital resources for this chapter that are maintained on the Internet.

You can find a complete list of digital resources for all book chapters on the companion website: https://ai.lange-analytics.com/digitalresources.html

What is AI?

A short (5 min) video by Cassie Kozyrkov, former Chief Decision Scientist at Google. Although AI is not clearly defined in the literature, Cassie Kozyrkov attempts (successfully) to provide a concise definition.

Link: https://ai.lange-analytics.com/dr?a=332

What is Machine Learning?

A short (3 min) video by Cassie Kozyrkov, former Chief Decision Scientist at Google about what is considered machine learning.

Link: https://ai.lange-analytics.com/dr?a=327

Why Use Machine Learning?

A short (3 min) video by Cassie Kozyrkov, former Chief Decision Scientist at Google. In this video, Cassie Kozyrkov explains why and how machine learning can be useful.

Link: https://ai.lange-analytics.com/dr?a=328

The Basic Idea Behind Machine Learning Algorithms

A short (5 min) video by Cassie Kozyrkov, former Chief Decision Scientist at Google. In the video, Cassie Kozyrkov gives a very basic and intuitive idea of how some machine learning algorithms work.

Link: https://ai.lange-analytics.com/dr?a=329

More Digital Resources

Only a subset of digital resources is listed in this section. The link below points to additional, concurrently updated resources for this chapter.

Link: https://ai.lange-analytics.com/dr/mlintro.html

References

Hennig, Christian M., Fionn Murtagh, and Roberto Rocci. 2016. Handbook of Cluster Analysis. Chapman; Hall/CRC, Boca Raton, FL.

Wong, Kay Jan. 2023. “6 Types of Clustering Methods — an Overview.” Towards Data Science, March. https://towardsdatascience.com/6-types-of-clustering-methods-an-overview-7522dba026ca.

Dall-E is an AI system that can create realistic images and art from a description in natural language (see https://openai.com/dall-e-2/).↩︎