Chapter 12 Concluding Remarks

I hope the previous chapters gave you a basic but sound foundation about how to use machine learning models for research or applications in your industry. Especially, I hope that the book sparked your interest in exploring machine learning further.

This leads to the question, where to go next? What is a good strategy to further extend your machine learning knowledge? There are many answers to this question. We will focus on six strategies:

  1. Extend your knowledge about the mathematical foundations of machine learning models. One of the standard machine learning books that gives you a sound mathematical introduction is Introduction to Statistical Learning by James et al. (2023).

  2. Extend your machine learning toolkit. We used the tidymodels package as a toolkit in the previous chapters of this book. The tidymodels package offers two distinctive strengths: i) It standardizes the data and modeling workflow, and ii) it provides many strong machine learning tools. We focused on the standardization aspect of machine learning that tidymodels provides. If you would like to take advantage of the analytical strength of tidymodels, then Tidy Modeling with R by Kuhn and Silge (2022) is the right place to start your journey.

  3. Apply machine learning directly to your field of research or a use case in your industry, such as finance or marketing. For the latter fields, we recommend Tidy Finance with R by Voigt, Weiss, and Scheuch (2023) and Hands-On Data Science for Marketing by Hwang (2019), respectively.

  4. At some point in time, you might need to review topics from this book. The best way to find a topic you are interested in is to go to the Learning Outcomes section of the related chapter. There, you can review the learning outcomes and then go to the section(s) you would like to read again. You can also use the AI assistant from the companion website: https://ai.lange-analytics.com/gen-ai/

  5. Take online classes. Although this can be expensive in some cases, free or less costly high-quality options are also available. For example, EdX provides excellent online courses authored by faculty from Harvard, UC San Diego, and other top universities. One course we especially recommend is Machine Learning Fundamentals. It introduces machine learning in a more formal way than this book, but the course is still intuitive. EdX courses are free as long as you do not require a certificate.

    Another good source for online courses is DataCamp. The DataCamp learning platform provides a wide range of data science and AI online courses with readings, videos, and exercises. Some courses are free, and the first chapters of all other courses are free to read. This way you can find out if DataCamp is for you and if you want to subscribe for $25/month.

  6. Extend your machine learning skills with a more random strategy. As you know from machine learning (see Bootstrapping in Chapter 10), randomization can be a solid tool to reach a goal.

    For a more random learning strategy, signing up to Medium might be a good option. Depending on your preferences and reading history, you get a daily email with recommendations for articles to read. If you think that you do not have the time, you could read in the evening in bed, and this way, carve the needed time out of your busy schedule.

    Read the articles without pressure but with enjoyment. If you are tired after a few pages, stop reading and sleep. Otherwise, read the complete article and sleep a little later. In any case, you win either by falling asleep smoothly or improving your skills. Signing up to Medium is free, and many articles are free to read. In addition, you can read up to three “members only” articles for free. If you would like to read more, you can consider a subscription for $50/year.

Whatever strategy you use to extend your knowledge in machine learning, I wish you an enjoyable journey.

Happy Analytics!

Aden-Buie, Garrick, Barret Schloerke, and JJ Allaire. 2022. Learnr: Interactive Tutorials for r. https://CRAN.R-project.org/package=learnr.
Asaithambi, Sudharsan. 2017. “Why, How and When to Scale Your Features.” GreyAtom.
Bank of England. 2022. Bank of England Inflation Calculator.” Online.
Beck, Marcus W. 2018. NeuralNetTools: Visualization and Analysis Tools for Neural Networks.” Journal of Statistical Software 85 (11): 1–20. https://doi.org/10.18637/jss.v085.i11.
Biecek, Przemyslaw. 2018. “Dalex: Explainers for Complex Predictive Models in R.” Journal of Machine Learning Research 19 (84): 1–5. https://jmlr.org/papers/v19/18-416.html.
Centers for Disease Control and Prevention (CDC). 2021a. “Vaccine Hesitancy for COVID-19: County and Local Estimate.” Online, June.
———. 2021b. COVID-19 Vaccinations in the United States, County.” Online, September.
Chan, Chung-Hong, Geoffrey C. H. Chan, Thomas J. Leeper, and Jason Becker. 2021. Rio: A Swiss-Army Knife for Data File i/o.
Chang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara Borges. 2022. Shiny: Web Application Framework for r. https://CRAN.R-project.org/package=shiny.
Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority over-Sampling Technique.” Journal of Artificial Intelligence Research 16. https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/chawla2002.html.
Chen, Tianqi, and Carlos Guestrin. 2016. XGBoost a Scalable Tree Boosting System.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, edited by Balaji Krishnapuram, 321–57. ACM, New York, NY. https://doi.org/10.1145/2939672.2939785.
Chen, Tianqi, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, et al. 2023. XGBoost: Extreme Gradient Boosting. https://CRAN.R-project.org/package=xgboost.
Corporation, Microsoft, and Steve Weston. 2022. Doparallel: Foreach Parallel Adaptor for the ’Parallel’ Package. https://CRAN.R-project.org/package=doParallel.
Cortez, Paulo, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. 2009. “Modeling Wine Preferences by Data Mining from Physicochemical Properties.” Decision Support Systems 47 (4): 547–53. https://doi.org/10.1016/j.dss.2009.05.016.
Delgado, Fernando. 2022. “A Beginner’s Guide to CatBoost with Python.” MLearning.ai, June. https://medium.com/mlearning-ai/a-beginners-guide-to-catboost-with-python-763d7e7ac199.
Delua, Julianna. 2021. “Supervised Vs. Unsupervised Learning: What’s the Difference?” IBM Blog. https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-learning.
Esri (Environmental Systems Research Institute). 2023. “Geoenrichment.” Online. https://www.esri.com/en-us/arcgis/products/location-services/services/geoenrichment.
Federal Reserve Bank of St. Louis. 2023. “Economic Research Resources.” Online.
Firke, Sam. 2023. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.
Freund, Yoav, and Robert E. Schapire. 1996. “Experiments with a New Boosting Algorithm.” In Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, 148–56. ICML’96. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Friedman, Jerome, Robert Tibshirani, and Trevor Hastie. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. https://doi.org/10.18637/jss.v033.i01.
Galton, Francis. 1907. “Vox Populi.” Nature 75 (1949): 450–51. https://doi.org/10.1038/075450a0.
Greenwell, Brandon M., and Bradley C. Boehmke. 2020. “Variable Importance Plots—an Introduction to the Vip Package.” The R Journal 12 (1): 343–66. https://doi.org/10.32614/RJ-2020-013.
Gujarati, D. N., and D. C. Porter. 2009. Basic Econometrics. Economics Series. McGraw-Hill, New York, NY. https://books.google.com/books?id=6l1CPgAACAAJ.
Hanck, Christoph, Martin Arnoldand, Alexander Gerber, and Martin Schmelzer. 2023. Introduction to Econometrics with R. Online. https://www.econometrics-with-r.org/index.html.
Haykin, Simon. 1999. Neural Networks: A Comprehensive Foundation (2nd Edition). Prentice Hall, Upper Saddle River, NJ.
Hechenbichler, K., and K. Schliep. 2004. “Weighted k-Nearest-Neighbor Techniques and Ordinal Classification.” Vol. 399. Sfb386. Institut für Statistik, Ludwig-Maximilians-Universität München. https://epub.ub.uni-muenchen.de/1769/1/paper_399.pdf.
Hendricks, Paul. 2015. Titanic: Titanic Passenger Survival Data Set. https://CRAN.R-project.org/package=titanic.
Hennig, Christian M., Fionn Murtagh, and Roberto Rocci. 2016. Handbook of Cluster Analysis. Chapman; Hall/CRC, Boca Raton, FL.
History.com. 2023. “Titanic.” Online, April. https://www.history.com/topics/early-20th-century-us/titanic#:~:text=The RMS Titanic, a luxury,their lives in the disaster.
Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1989. “Multilayer Feedforward Networks Are Universal Approximators.” Neural Networks 2 (5): 359–66. https://doi.org/10.1016/0893-6080(89)90020-8.
Hvitfeldt, Emil. 2023. Themis: Extra Recipes Steps for Dealing with Unbalanced Data.
Hvitfeldt, Emil, Thomas Lin Pedersen, and Michaël Benesty. 2022. Lime: Local Interpretable Model-Agnostic Explanations. Vignette: Understanding Lime. https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html.
Hwang, Yoon Hyup. 2019. Hands-on Data Science for Marketing: Improve Your Marketing Strategies with Machine Learning Using Python and r. Packt, Birmingham, United Kingdom.
IBM. 2021. “Telco Customer Churn.” Online.
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor. 2023. An Introduction to Statistical Learning. Springer, New York, NY. https://doi.org/10.1007/978-3-031-38747-0.
Jensen, J. D., J. Thurman, and A. L. Vincent. 2021. Lightning Injuries. Online; Statpearls, Treasure Island, FL. https://pubmed.ncbi.nlm.nih.gov/28722949/.
Kaggle. 2015. “House Sales in King County, USA.” Online. https://www.kaggle.com/datasets/harlfoxem/housesalesprediction.
———. 2018. “Telco Customer Churn.” Online. https://www.kaggle.com/datasets/blastchar/telco-customer-churn/metadata.
Kuhn, Max. 2008. “Building Predictive Models in R Using the Caret Package.” Journal of Statistical Software 28 (5): 1–26. https://doi.org/10.18637/jss.v028.i05.
Kuhn, Max, and Daniel Falbel. 2022. Brulee: High-Level Modeling Functions with ’torch’. https://CRAN.R-project.org/package=brulee.
Kuhn, Max, and Julia Silge. 2022. Tidy Modeling with r. A Framework for Modeling in the Tidyverse. O’Reilly, Sebastopol, CA. https://www.tmwr.org/.
Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.
Kurama, Vihar. 2018. “A Guide to AdaBoost: Boosting to Save the Day.” Paperspace Blog, Series: Ensemble Methods.
Lange, Carsten. 2003. Neuronale Netze in der Wirtschaftswissenschaftlichen Prognose und Modellgenerierung (Neural Networks in Economic Modeling). Physica, Heidelberg, Germany.
Lange, Carsten, and Jian Lange. 2022. “Applying Machine Learning and AI Explanations to Analyze Vaccine Hesitancy.” arXiv, January.
LeCun, Yann, Corinna Cortes, and Christopher J. C. Burges. 2005. “The MNIST Database of Handwritten Digits.” Online.
Lundberg, Scott M., Gabriel G. Erion, and Su-In Lee. 2019. “Consistent Individualized Feature Attribution for Tree Ensembles.” arXiv. https://arxiv.org/abs/1802.03888.
Lundberg, Scott, and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” ArXiv. https://arxiv.org/abs/1705.07874.
Lyer, Vijayasri. 2021. “Behold: The Confusion Matrix. Not a Confusing Matrix Anymore.” Medium. https://vijayasriiyer.medium.com/behold-the-confusion-matrix-10afd3feb603.
Maksymiuk, Szymon, Alicja Gosiewska, and Przemyslaw Biecek. 2020. “Landscape of R Packages for Explainable Artificial Intelligence.” arXiv. https://arxiv.org/abs/2009.13248.
Manassa, I. 2021. “Mathematics Behind Gradient Descent.” Geek Culture. https://medium.com/geekculture/mathematics-behind-gradient-descent-f2a49a0b714f.
McCulloch, Warren S., and Walter Pitts. 1943. “A Logical Calculus of the Ideas Immanent in Nervous Activity.” The Bulletin of Mathematical Biophysics 5 (4): 115–33. https://doi.org/10.1007/bf02478259.
McDonald, John F., and Robert A. Moffitt. 1980. “The Uses of Tobit Analysis.” The Review of Economics and Statistics 62 (2): 318. https://doi.org/10.2307/1924766.
Milborrow, Stephen. 2022. Rpart.plot: Plot ’Rpart’ Models: An Enhanced Version of ’Plot.rpart’. https://CRAN.R-project.org/package=rpart.plot.
Mohajon, Joydwip. 2020. “Confusion Matrix for Your Multi-Class Machine Learning Model.” Towards Data Science. https://towardsdatascience.com/confusion-matrix-for-your-multi-class-machine-learning-model-ff9aa3bf7826.
Molnar, Christoph. 2020. Interpretable Machine Learning. A Guide for Making Black Box Models Explainable. Second. Independently published.
Morde, Vishal. 2019. XGBoost Algorithm: Long May She Reign!” Towards Data Science, April.
Narula, Sabhash C. 1979. “Orthogonal Polynomial Regression.” International Statistical Review/Revue Internationale de Statistique 47 (1): 31–36. http://www.jstor.org/stable/1403204.
O‘Sullivan, Conor. 2022a. KernelSHAP Vs TreeSHAP.” Towards Data Science, July. https://towardsdatascience.com/kernelshap-vs-treeshap-e00f3b3a27db.
———. 2022b. “From Shapley to SHAP — Understanding the Math.” Towards Data Science, August. https://towardsdatascience.com/from-shapley-to-shap-understanding-the-math-e7155414213b.
Park, A. et al. 2021. “Presidential Precinct Data for the 2020 General Election.” Edited by New York Times. New York Times, April. https://github.com/TheUpshot/presidential-precinct-map-2020.
Pramoditha, Rukshan. 2021. “Can LightGBM Outperform XGBoost? Boosting Algorithms in Machine Learning — Part 5.” Towards Data Science, October. https://towardsdatascience.com/can-lightgbm-outperform-xgboost-d05a94102a55.
R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier.” arXiv, February. https://doi.org/10.48550/ARXIV.1602.04938.
RStudio Team. 2022. RStudio: Integrated Development Environment for r. Boston, MA: RStudio, Inc. http://www.rstudio.com/.
Sanchez, Gaston. 2021. Handling Strings with r. Leanpub, Victoria, Canada. https://www.gastonsanchez.com/r4strings/.
Schliep, Klaus, and Klaus Hechenbichler. 2016. Kknn: Weighted k-Nearest Neighbors. https://CRAN.R-project.org/package=kknn.
Shapley, L. S. 1953. “A Value for n-Person Games.” In Contributions to the Theory of Games (AM-28), Volume II, 307–18. Princeton University Press, Princeton, NJ. https://doi.org/10.1515/9781400881970-018.
Sharma, Abhishek. 2020. “4 Simple Ways to Split a Decision Tree in Machine Learning.” Analytics Vidhya, June. https://www.analyticsvidhya.com/blog/2020/06/4-ways-split-decision-tree/.
Shea, Justin M. 2023. Wooldridge: 115 Data Sets from "Introductory Econometrics: A Modern Approach, 7e" by Jeffrey m. Wooldridge. https://CRAN.R-project.org/package=wooldridge.
Singh, Himanshi. 2021. “How to Select Best Split in Decision Trees Using Gini Impurity.” Analytics Vidhye, March. https://www.analyticsvidhya.com/blog/2021/03/how-to-select-best-split-in-decision-trees-gini-impurity/.
Tay, J. Kenneth, Balasubramanian Narasimhan, and Trevor Hastie. 2023. Elastic Net Regularization Paths for All Generalized Linear Models.” Journal of Statistical Software 106 (1): 1–31. https://doi.org/10.18637/jss.v106.i01.
Therneau, Terry, and Beth Atkinson. 2022. Rpart: Recursive Partitioning and Regression Trees. https://CRAN.R-project.org/package=rpart.
Vaughan, Davis. 2022. “Multiclass Averaging.” Online. https://yardstick.tidymodels.org/articles/multiclass.html.
Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with s. Fourth. Springer, New York, NY. https://www.stats.ox.ac.uk/pub/MASS4/.
Verhulst, Pierre-François. 1845. Recherches Mathématiques Sur La Loi D’accroissement De La Population 18: 2013–15.
Voigt, Stefan, Patrick Weiss, and Christoph Scheuch. 2023. Tidy Finance with R. Chapman; Hall/CRC, Boca Raton, FL. https://doi.org/10.1201/b23237.
Wang, Chi-Feng. 2019. “The Vanishing Gradient Problem. Its Causes, Its Significance, and Its Solutions.” Towards Data Science. https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer, New York, NY. https://ggplot2.tidyverse.org.
———. 2019. Advanced r. Chapman; Hall/CRC, Boca Raton, FL.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy DAgostino McGowan, Romain Francois, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly, Sebastopol, CA. https://r4ds.had.co.nz/index.html.
Wikipedia contributors. 2023a. “Camel Case.” Wikipedia, the Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Camel_case&oldid=1188598129.
———. 2023b. “Diversity Index.” Wikipedia, the Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Diversity_index&oldid=1189901595.
———. 2023c. “Feature Scaling.” Wikipedia, the Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Feature_scaling&oldid=1191906790.
———. 2023d. “Sigmoid Function.” Wikipedia, the Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Sigmoid_function&oldid=1187110185.
Wong, Kay Jan. 2023. “6 Types of Clustering Methods — an Overview.” Towards Data Science, March. https://towardsdatascience.com/6-types-of-clustering-methods-an-overview-7522dba026ca.
Wooldridge, Jeffrey Marc. 2020. Introductory Econometrics: A Modern Approach. Seventh. Cengage Learning, Boston, MA. http://books.google.ch/books?id=64vt5TDBNLwC.
Wright, Marvin N., and Andreas Ziegler. 2017. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77 (1): 1–17. https://doi.org/10.18637/jss.v077.i01.
Yıldırım, Soner. 2020. Gradient Boosted Decision Trees-Explained.” Towards Data Science, February. https://towardsdatascience.com/gradient-boosted-decision-trees-explained-9259bd8205af.
Zhu, Hao. 2021. kableExtra: Construct Complex Table with ’kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.

References

Hwang, Yoon Hyup. 2019. Hands-on Data Science for Marketing: Improve Your Marketing Strategies with Machine Learning Using Python and r. Packt, Birmingham, United Kingdom.
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor. 2023. An Introduction to Statistical Learning. Springer, New York, NY. https://doi.org/10.1007/978-3-031-38747-0.
Kuhn, Max, and Julia Silge. 2022. Tidy Modeling with r. A Framework for Modeling in the Tidyverse. O’Reilly, Sebastopol, CA. https://www.tmwr.org/.
Voigt, Stefan, Patrick Weiss, and Christoph Scheuch. 2023. Tidy Finance with R. Chapman; Hall/CRC, Boca Raton, FL. https://doi.org/10.1201/b23237.