4.3 Decision Trees

4.3.1 Introduction

Note that a K-NN classifier is model-free. The whole training set must be stored and referred to at all times.

Therefore, it doesn’t explain the data we have – we may use it solely for the purpose of prediction.

Perhaps one of the most interpretable (and hence human-friendly) models consist of decision rules of the form:

IF \(x_{i,j_1}\le v_1\) AND … AND \(x_{i,j_r}\le v_r\) THEN \(\hat{y}_i=1\).

These can be organised into a hierarchy for greater readability.

This idea inspired the notion of decision trees (Breiman et al. 1984).

plot of chunk unnamed-chunk-24

Each tree node reports 3 pieces of information:

  • dominating class (0 or 1)
  • (relative) proportion of 1s represented in a node
  • (absolute) proportion of all observations in a node

plot of chunk unnamed-chunk-25

plot of chunk unnamed-chunk-26

4.3.2 Example in R

We will use the rpart() function from the rpart package to build a classification tree.

rpart() uses a formula (~) interface, hence it will be easier to feed it with data in a data.frame form.

Fit and plot a decision tree:

plot of chunk unnamed-chunk-29

The fitted model is rather… simple.

Only the alcohol variable is taken into account.

Well note how these two distributions are shifted:

plot of chunk unnamed-chunk-30

Make predictions:

##          Acc         Prec          Rec            F 
##    0.8202880    0.5447154    0.2248322    0.3182898 
##           TN           FN           FP           TP 
## 1243.0000000  231.0000000   56.0000000   67.0000000

(*) Interestingly, rpart() also provides us with information about the importance degrees of each independent variable.

##              alcohol              density            chlorides 
##         0.5664666547         0.2580265845         0.1196468644 
##        fixed.acidity     volatile.acidity            sulphates 
##         0.0187346795         0.0138674578         0.0115060641 
##       residual.sugar total.sulfur.dioxide  free.sulfur.dioxide 
##         0.0078854172         0.0036807388         0.0001855391

We can build a much more complex tree by playing with the cp parameter.

plot of chunk unnamed-chunk-33

##          Acc         Prec          Rec            F 
##    0.8159048    0.5100000    0.3422819    0.4096386 
##           TN           FN           FP           TP 
## 1201.0000000  196.0000000   98.0000000  102.0000000

4.3.3 A Note on Decision Tree Learning

Learning an optimal decision tree is a computationally hard problem – we need some heuristics.


  • ID3 (Iterative Dichotomiser 3) (Quinlan 1986)
  • C4.5 algorithm (Quinlan 1993)
  • CART by Leo Breiman et al., (Breiman et al. 1984)

(**) Decision trees are most often constructed by a greedy, top-down recursive partitioning, see., e.g., (Therneau & Atkinson 2019).