## 4.3 Decision Trees

### 4.3.1 Introduction

Note that a K-NN classifier is model-free. The whole training set must be stored and referred to at all times.

Therefore, it doesn’t explain the data we have – we may use it solely for the purpose of prediction.

Perhaps one of the most interpretable (and hence human-friendly) models consist of decision rules of the form:

IF $$x_{i,j_1}\le v_1$$ AND … AND $$x_{i,j_r}\le v_r$$ THEN $$\hat{y}_i=1$$.

These can be organised into a hierarchy for greater readability.

This idea inspired the notion of decision trees (Breiman et al. 1984). Each tree node reports 3 pieces of information:

• dominating class (0 or 1)
• (relative) proportion of 1s represented in a node
• (absolute) proportion of all observations in a node  ### 4.3.2 Example in R

We will use the rpart() function from the rpart package to build a classification tree.

library("rpart")
library("rpart.plot")
set.seed(123)

rpart() uses a formula (~) interface, hence it will be easier to feed it with data in a data.frame form.

XY_train <- as.data.frame(cbind(X_train, Y=Y_train))
XY_test <- as.data.frame(cbind(X_test, Y=Y_test))

Fit and plot a decision tree:

t1 <- rpart(Y~., data=XY_train, method="class")
rpart.plot(t1) The fitted model is rather… simple.

Only the alcohol variable is taken into account.

Well note how these two distributions are shifted:

vioplot::vioplot(alcohol~Y, data=XY_train,
horizontal=TRUE, las=1) Make predictions:

Y_pred <- predict(t1, XY_test, type="class")
get_metrics(Y_test, Y_pred)
##          Acc         Prec          Rec            F
##    0.8202880    0.5447154    0.2248322    0.3182898
##           TN           FN           FP           TP
## 1243.0000000  231.0000000   56.0000000   67.0000000

(*) Interestingly, rpart() also provides us with information about the importance degrees of each independent variable.

t1$variable.importance/sum(t1$variable.importance)
##              alcohol              density            chlorides
##         0.5664666547         0.2580265845         0.1196468644
##        fixed.acidity     volatile.acidity            sulphates
##         0.0187346795         0.0138674578         0.0115060641
##       residual.sugar total.sulfur.dioxide  free.sulfur.dioxide
##         0.0078854172         0.0036807388         0.0001855391

We can build a much more complex tree by playing with the cp parameter.

# cp = complexity parameter, smaller → more complex tree
t2 <- rpart(Y~., data=XY_train, method="class", cp=0.007)
rpart.plot(t2, tweak=2.5, compress=FALSE) Y_pred <- predict(t2, XY_test, type="class")
get_metrics(Y_test, Y_pred)
##          Acc         Prec          Rec            F
##    0.8159048    0.5100000    0.3422819    0.4096386
##           TN           FN           FP           TP
## 1201.0000000  196.0000000   98.0000000  102.0000000

### 4.3.3 A Note on Decision Tree Learning

Learning an optimal decision tree is a computationally hard problem – we need some heuristics.

Examples:

• ID3 (Iterative Dichotomiser 3) (Quinlan 1986)
• C4.5 algorithm (Quinlan 1993)
• CART by Leo Breiman et al., (Breiman et al. 1984)

(**) Decision trees are most often constructed by a greedy, top-down recursive partitioning, see., e.g., (Therneau & Atkinson 2019).