This is a slightly older (distributed in the hope that it will be useful) version of the forthcoming textbook (ETA 2022) preliminarily entitled Machine Learning in R from Scratch by Marek Gagolewski, which is now undergoing a major revision (when I am not busy with other projects). There will be not much work on-going in this repository anymore, as its sources have moved elsewhere; however, if you happen to find any bugs or typos, please drop me an email. I will share a new draft once it’s ripe. Stay tuned.

Copyright (C) 2020-2021, Marek Gagolewski.

This material is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).

You can access this book at:

Aims and Scope

Machine learning has numerous exciting real-world applications, including stock market prediction, speech recognition, computer-aided medical diagnosis, content and product recommendation, anomaly detection in security camera footage, game playing, autonomous vehicle operation, and many others.

In this book we will take an unpretentious glance at the most fundamental algorithms that have stood the test of time and which form the basis for state-of-the-art solutions of modern AI, which is principally (big) data-driven. We will learn how to use the R language (R Development Core Team 2021) for implementing various stages of data processing and modelling activities. For a more in-depth treatment of R, refer to this book’s Appendices and, for instance, (Wickham & Grolemund 2017, Peng 2019, Venables et al. 2021).

These pages contain solid underpinnings for further studies related to statistical learning, machine learning data science, data analytics, and artificial intelligence, including (Bishop 2006, Hastie et al. 2017, James et al. 2017). We will also appreciate the vital role of mathematics as a universal language for formalising data-intense problems and communicating their solutions. The book is aimed at readers who are yet to be fluent with university-level linear algebra, calculus and probability theory, such as 1st year undergrads or those who have forgotten all the maths they have learned and need a gentle, non-invasive, yet rigorous introduction to the topic. For a nice, machine learning-focused introduction to mathematics alone, see, e.g., (Deisenroth et al. 2020).

About the Author

Marek Gagolewski (pronounced like Mark Gaggle-Eve-Ski) is currently a Senior Lecturer in Applied AI at Deakin University in Melbourne, VIC, Australia and an Associate Professor in Data Science (on long-term leave) at Faculty of Mathematics and Information Science, Warsaw University of Technology, Poland.

He is actively involved in developing usable free (libre) and open source software, with particular focus on data science and machine learning. He is the main author and maintainer of stringi – one of the most often downloaded R packages (with over 33,000,000 downloads) that aims at natural language and string processing as well as the Python and R package genieclust implementing the fast and robust hierarchical clustering algorithm Genie with noise point detection.

He’s an author of more than 75 publications on machine learning and optimisation algorithms, data aggregation and clustering, statistical modelling, and scientific computing. Explaining of things matters to him more than merely tuning the knobs so as to increase a chosen performance metric (with uncontrollable consequences to other ones); the latter belongs to technology and wizardry, not science.

Moreover, Marek taught various courses related to R and Python programming, algorithms, data science, and machine learning in Australia, Poland, and Germany (e.g., at Data Science Retreat).


This book has been prepared with pandoc, Markdown, and GitBook. R code chunks have been processed with knitr. A little help of bookdown, good ol’ Makefiles, and shell scripts did the trick.

The following R packages are used or referred to in the text: bookdown, Cairo, DEoptim, fastcluster, FNN, genie, genieclust, gsl, hydroPSO, ISLR, keras, knitr, Matrix, microbenchmark, pdist, RColorBrewer, recommenderlab, rpart, rpart.plot, rworldmap, scatterplot3d, stringi, tensorflow, tidyr, titanic, vioplot.

During the writing of this book, I’ve been listening to the music featuring John Coltrane, Krzysztof Komeda, Henry Threadgill, Albert Ayler, Paco de Lucia, and Tomatito.


Bishop C (2006) Pattern recognition and machine learning. Springer-Verlag

Deisenroth MP, Faisal AA, Ong CS (2020) Mathematics for machine learning. Cambridge University Press

Hastie T, Tibshirani R, Friedman J (2017) The elements of statistical learning. Springer-Verlag

James G, Witten D, Hastie T, Tibshirani R (2017) An introduction to statistical learning with applications in R. Springer-Verlag

Peng RD (2019) R programming for data science.

R Development Core Team (2021) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria

Venables WN, Smith DM, R Core Team (2021) An introduction to R.

Wickham H, Grolemund G (2017) R for data science. O’Reilly