D.5 Metaprogramming and Formulas (*)

R (together with a few other programming languages such as Lisp and Scheme, that heavily inspired R’s semantics) allows its programmers to apply some metaprogramming techniques, that is, to write programs that manipulate unevaluated R expressions.

For instance, take a close look at the following plot:

plot of chunk unnamed-chunk-19

How did the plot() function know that we are plotting sin of z? It turns out that, at any time, we not only have access to the value of an object (such as the result of evaluating sin(z), which is a vector of 101 reals) but also to the expression that was passed as a function’s argument itself.

## x equals to  9 
## x stemmed from  2 + 7

This is very powerful and yet potentially very confusing to the users, because we can write functions that don’t compute the arguments provided in a way we expect them to (i.e., following the R language specification). Each function can constitute a new micro-verse, where with its own rules – we should always refer to the documentation.

For instance, consider the subset() function:

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
##     Sepal.Length   Species
## 106          7.6 virginica
## 118          7.7 virginica
## 119          7.7 virginica
## 123          7.7 virginica
## 132          7.9 virginica
## 136          7.7 virginica

Neither Sepal.Length>6 nor -(Sepal.Width:Petal.Width) make sense as standalone R expressions! However, according to the subset() function’s own rules, the former expression is considered as a row selector (here, Sepal.Length refers to a particular column within the iris data frame). The latter plays the role of a column filter (select everything but all the column between…).

The data.table and dplyr functions rely on this language feature all the time, so we shouldn’t be surprised when we see them.

There is one more interesting language feature that is possible thanks to metaprogramming. Formulas are special R objects that consist of two unevaluated R expressions separated by a tilde (~). For example:

## len ~ supp + dose

A formula on its own has no meaning. However, many R functions accept formulas as arguments and can interpret them in various different ways.

For example, the lm() function that fits a linear regression model, uses formulas to specify the output and input variables:

## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length + Sepal.Width, data = iris)
## 
## Coefficients:
##  (Intercept)  Petal.Length   Sepal.Width  
##       2.2491        0.4719        0.5955

On the other hand, boxplot() allows for creating separate box-and-whisker plots for each subgroup given by a combination of factors.

plot of chunk unnamed-chunk-24

The aggregate() function supports formulas too:

##      Species Sepal.Length Sepal.Width
## 1     setosa        5.006       3.428
## 2 versicolor        5.936       2.770
## 3  virginica        6.588       2.974

We should therefore make sure that we know how every function interacts with a formula – information on that can be found in ?lm, ?boxplot, ?aggregate and so forth.