Chapter 1 Introduction

This is an Open Access web version of the book Practical Machine Learning with R published by Chapman and Hall/CRC. The content from this website or any part of it may not be copied or reproduced. All copyrights are reserved.

If you find an error, have an idea for improvement, or have a question, please visit the forum for the book. You find additional resources for the book on its companion website, and you can ask the experimental AI assistant about topics covered in the book.

If you enjoy reading the online version of Practical Machine Learning with R, please consider ordering a paper, PDF, or Kindle version to support the development of the book and help me maintain this free online version.

With rapid advancements in recent years, machine learning and artificial intelligence (AI) have become increasingly relevant and have already started transforming many businesses. Knowledge about machine learning and AI is critical not just for STEM (science, technology, engineering, and mathematics) majors but also in many other fields, including business, economics, and other social science majors.

To better prepare students for a future with AI, I started teaching machine learning in 2019 to help economics students to understand and apply fundamental machine learning principles. My goal is to combine theoretical concepts with hands-on projects to equip the students with the skills to solve real-world problems. When looking for a textbook, I found some excellent books covering the foundations and mathematical theories behind machine learning. Unfortunately, unlike STEM students, many business and economics students do not have the strong mathematics and programming background required to make use of these books. On the other side of the spectrum, machine learning books that target business majors are often too focused on applications and fall short of explaining quantitative methods. Recognizing that there is a need for a machine learning textbook that implements a hands-on approach, where students can interactively work step-by-step, and that introduces the quantitative concepts in a less mathematical way gave me the idea to write this book.

This book introduces machine learning algorithms and explains the underlying concepts without using higher mathematics concepts like matrix algebra or calculus. Each chapter provides examples, case studies, and interactive tutorials. The examples and hands-on tutorials use the R language, which is widely used for statistical analysis and data science. R’s relatively simple syntax makes it easy for beginners to learn the language, making R a good choice for teaching machine learning. No prior programming skills are required to work with this book. A designated R chapter introduces the R skills needed for the course. In addition, each chapter offers one or more interactive R tutorials. Students can work with real-world data and use the interactive environment to learn and experiment with R code in a web browser.

1.1 How the Book is Organized

The order in which machine learning algorithms are introduced in this book does not follow the traditional order of most other machine learning textbooks. Most machine learning books cover theoretical concepts and terminologies upfront before applying machine learning algorithms to solve problems.

In my experience, a more effective way of teaching machine learning is to introduce the basics of an algorithm and apply the model immediately. Along the way, we will introduce the necessary terminology and theory when they become relevant and contextual to the specific workflow that is being explained.

A good example is the k-Nearest Neighbors algorithm, which is usually covered in the later part of most textbooks. Here, k-Nearest Neighbors is the first algorithm covered. The reason is that k-Nearest Neighbors can be explained and applied without any prior knowledge in data science. Within the k-Nearest Neighbors chapter, the concept of the confusion matrix is explained, which can be used to interpret the k-Nearest Neighbors results and is needed for some of the subsequent chapters.

The principle of introducing essential data science concepts and terminology on a “when-needed basis” is reflected in the chapter titles. Most chapters are titled with the machine learning algorithm covered in the chapter — together with a subtitle reflecting the data science concept(s) introduced in connection with the machine learning algorithm.

1.2 Using `tidymodels` for Data Processing and Model Workflows

The tidymodels package (Kuhn and Wickham (2020)), an extension of R, is used throughout the book.¹ This package unifies and limits the commands needed to work with the various machine learning algorithms regardless of which machine learning model is used. Only a few lines must be adjusted when a new machine learning model is introduced. The code is more intuitive and easy to read, which allows readers to focus on the logic of the algorithms rather than on programming syntax. Instead of reinventing the wheel and providing new machine learning libraries, tidymodels provides wrappers for the most common machine learning models in R. This way, the code is unified, but still, the most advanced and up-to-date libraries, such as TensorFlow and PyTorch are used.

1.3 Interactive Sections and Digital Resources

Each chapter contains one or more interactive sections marked with the compass symbol (🧭) and also contains a section called Digital Resources.²

In the interactive sections, readers can use a provided web link to download a file, which can be executed in RStudio, the interactive development environment for R. When the file is executed in RStudio, all text, formulas, and graphs from the code blocks are shown in a web browser in a format similar to the book section. In addition, readers will be able to interactively work with the R code by solving exercises or experimenting with the provided R code (modifying function parameters, etc.). This hands-on approach allows students to apply a machine learning algorithm to real-world scenarios.

In addition to the interactive sections, each chapter also contains a Digital Resources section. These sections provide links to short articles, videos, data, and sample R code scripts. The most relevant digital resources are directly available through the Digital Resource sections at the end of each chapter. Additional digital resources are provided on a website that is linked to the Digital Resource section, which is expanded and updated on an ongoing basis.

1.4 Companion Website

The book comes with a companion website that provides additional materials for students and instructors.

You can access the companion website at: https://ai.lange-analytics.com

Student resources include:

An extended version of the Digital Resources for each chapter (see Section 1.3 for details about digital resources).
A moderated forum where readers can post questions or discuss content related to the topics in the book.
A blog that covers topics related to the book chapters and, in many cases, allows a deeper dive into the concepts covered.
Videos that complement the chapters of the book.
A recommendation system where readers can post ideas for the book or report errors.

Instructor resources include:

Presentation slides for each chapter. An instructor can use these slides out of the box or adjust them to the instructor’s needs.
Not every instructor will use all chapters and sections of the book, but deciding which topics to cover can be very time-consuming. Therefore, the companion website provides several syllabus outlines depending on the length of the related course, students’ backgrounds, and if the course is taught in a semester or quarter system. The recommended outlines ensure that all concepts required for a particular chapter have been covered in the previous chapters. Instructors can use a web-based application to customize the outlines. This web app will then generate the related outline, ready to be used in a syllabus.

1.5 Acknowledgments

The author wishes to thank all who helped to make this book possible. Special thanks go out to my research assistant Xavier Padilla, who reviewed all chapters, gave valuable feedback, and caught many inconsistencies and calculation errors. I would also like to thank Lizzie Diaz Toledo and Sebastian Chinen, who helped with indexing and editing.

My proof reader Erica Orloff caught many formatting, spelling, and punctuation errors. I greatly appreciate her thorough work.

All remaining errors are mine alone.

I am also grateful to my family, to my wife, Jian, who patiently proofread the book at various stages, and to my son Max, who tirelessly wrote Python programs for the book. These Python programs automated many tasks of the interactive and the digital resource sections, which made it possible to integrate these sections seamlessly into the book.

Last but certainly not least, I would like to thank David Grubbs, Curtis Hill, and Robin Lloyd-Starkes from Taylor and Francis. They supported me through the whole writing process and always made themselves available when I needed help or advice.

References

Kuhn, Max. 2008. “Building Predictive Models in R Using the Caret Package.” Journal of Statistical Software 28 (5): 1–26. https://doi.org/10.18637/jss.v028.i05.

Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.

The tidymodels package is the successor of the caret package (Kuhn (2008)). Like the caret package, it simplifies data processing and model workflow management, but tidymodels provides additional functionalities in a more systematic way.↩︎

Practical Machine Learning with R

Tutorials and Case Studies