D.4 Common Operations

We already know how to filter rows based on logical conditions, e.g.:

iris[iris$Sepal.Length > 6.5 & iris$Species == "versicolor", ]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 51          7.0         3.2          4.7         1.4 versicolor
## 53          6.9         3.1          4.9         1.5 versicolor
## 59          6.6         2.9          4.6         1.3 versicolor
## 66          6.7         3.1          4.4         1.4 versicolor
## 76          6.6         3.0          4.4         1.4 versicolor
## 77          6.8         2.8          4.8         1.4 versicolor
## 78          6.7         3.0          5.0         1.7 versicolor
## 87          6.7         3.1          4.7         1.5 versicolor

and aggregate information in individual columns:

sapply(iris[1:4], summary)
##         Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min.        4.300000    2.000000        1.000    0.100000
## 1st Qu.     5.100000    2.800000        1.600    0.300000
## Median      5.800000    3.000000        4.350    1.300000
## Mean        5.843333    3.057333        3.758    1.199333
## 3rd Qu.     6.400000    3.300000        5.100    1.800000
## Max.        7.900000    4.400000        6.900    2.500000

Quite frequently, we will be interested in summarising data within subgroups generated by a list of factor-like variables.

aggregate(iris[1:4], iris[5], mean)
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026
ToothGrowth[sample(nrow(ToothGrowth), 5), ] # 5 random rows
##     len supp dose
## 12 16.5   VC  1.0
## 10  7.0   VC  0.5
## 17 13.6   VC  1.0
## 50 27.3   OJ  1.0
## 60 23.0   OJ  2.0
aggregate(ToothGrowth["len"], ToothGrowth[c("supp", "dose")], median)
##   supp dose   len
## 1   OJ  0.5 12.25
## 2   VC  0.5  7.15
## 3   OJ  1.0 23.45
## 4   VC  1.0 16.50
## 5   OJ  2.0 25.95
## 6   VC  2.0 25.95

Taking into account that split() accepts a data frame input as well, we can perform what follows:

sapply(
# split iris into 3 sub-data.frames:
split(iris, iris[5]),
# on each sub-data.frame, apply the following function
function(df) {
# compute the mean of first four columns:
sapply(df[1:4], mean)
})
##              setosa versicolor virginica
## Sepal.Length  5.006      5.936     6.588
## Sepal.Width   3.428      2.770     2.974
## Petal.Length  1.462      4.260     5.552
## Petal.Width   0.246      1.326     2.026
sapply(split(iris, iris[5]), function(df) {
c(Sepal.Length=summary(iris$Sepal.Length), Petal.Length=summary(iris$Petal.Length)
)
})
##                        setosa versicolor virginica
## Sepal.Length.Min.    4.300000   4.300000  4.300000
## Sepal.Length.1st Qu. 5.100000   5.100000  5.100000
## Sepal.Length.Median  5.800000   5.800000  5.800000
## Sepal.Length.Mean    5.843333   5.843333  5.843333
## Sepal.Length.3rd Qu. 6.400000   6.400000  6.400000
## Sepal.Length.Max.    7.900000   7.900000  7.900000
## Petal.Length.Min.    1.000000   1.000000  1.000000
## Petal.Length.1st Qu. 1.600000   1.600000  1.600000
## Petal.Length.Median  4.350000   4.350000  4.350000
## Petal.Length.Mean    3.758000   3.758000  3.758000
## Petal.Length.3rd Qu. 5.100000   5.100000  5.100000
## Petal.Length.Max.    6.900000   6.900000  6.900000

The above syntax is not super-convenient, but it only uses the building blocks that we have already mastered! That should be very appealing to the minimalists. Note that R packages such as data.table and dplyr offer more convenient substitutes – you can always learn them on your own (which takes time, but it’s worth the hassle). They simplify the most common data wrangling tasks. Moreover, they have been optimised for speed – they can handle much larger data sets efficiently.