D Data Frame Wrangling in R

R data.frames are similar to matrices in the sense that we use them to store tabular data. However, in data frames each column can be of different type:

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
##               Country Disp Disp2 Eng.Rev Front.Hd Frt.Leg.Room
## Acura Integra   Japan  112   1.8    2935      3.5         41.5
## Acura Legend    Japan  163   2.7    2505      2.0         41.5
##               Frt.Shld Gear.Ratio Gear2  HP HP.revs Height
## Acura Integra     53.0       3.26  3.21 130    6000   47.5
## Acura Legend      55.5       2.95  3.02 160    5900   50.0
##               Length Luggage Mileage Model2 Price Rear.Hd
## Acura Integra    177      16      NA        11950     1.5
## Acura Legend     191      14      20        24760     2.0
##               Rear.Seating RearShld Reliability Rim Sratio.m
## Acura Integra         26.5     52.0 Much better R14       NA
## Acura Legend          28.5     55.5 Much better R15       NA
##               Sratio.p Steering Tank  Tires Trans1 Trans2
## Acura Integra     0.86    power 13.2 195/60  man.5 auto.4
## Acura Legend      0.96    power 18.0 205/60  man.5 auto.4
##               Turning   Type Weight Wheel.base Width
## Acura Integra      37  Small   2700        102    67
## Acura Legend       42 Medium   3265        109    69

D.1 Creating Data Frames

Most frequently, we will be creating data frames based on a series of numeric, logical, characters vectors of identical lengths.

##           u     v w
## 1 0.1815171  TRUE A
## 2 0.9197226 FALSE B
## 3 0.3117235 FALSE C
## 4 0.0641516  TRUE D
## 5 0.3964216 FALSE E

Note that when we create objects of type data frame, strings are automatically converted to factors.

## [1] "factor"

Throughout the history of computing with R, this has caused way too many bugs (recall, for instance, what’s the result of calling as.numeric() on a factor). In order to change this behaviour, either pass stringsAsFactors=FALSE argument to data.frame() or switch this feature off globally (recommended):

Some objects, such as matrices, can easily be coerced to data frames:

##      x  y  z  w
## [1,] 1  2  3  4
## [2,] 5  6  7  8
## [3,] 9 10 11 12
##   x  y  z  w
## 1 1  2  3  4
## 2 5  6  7  8
## 3 9 10 11 12

Named lists are amongst other candidates for a meaningful conversion:

## $setosa
##    min median   mean    max 
##  4.300  5.000  5.006  5.800 
## 
## $versicolor
##    min median   mean    max 
##  4.900  5.900  5.936  7.000 
## 
## $virginica
##    min median   mean    max 
##  4.900  6.500  6.588  7.900
##        setosa versicolor virginica
## min     4.300      4.900     4.900
## median  5.000      5.900     6.500
## mean    5.006      5.936     6.588
## max     5.800      7.000     7.900

D.2 Importing Data Frames

Many interesting data frames come from external sources, such as csv files, web APIs, SQL databases and so on.

In particular, read.csv() (see ?read.table for a long list of tunable parameters) imports data from plain text files organised in a tabular manner (such as comma-separated lists of values):

## "u","v","w"
## 0.181517061544582,TRUE,"A"
## 0.919722604798153,FALSE,"B"
## 0.31172346835956,FALSE,"C"
## 0.0641516039613634,TRUE,"D"
## 0.396421572659165,FALSE,"E"
##           u     v w
## 1 0.1815171  TRUE A
## 2 0.9197226 FALSE B
## 3 0.3117235 FALSE C
## 4 0.0641516  TRUE D
## 5 0.3964216 FALSE E

Note that CSV is by far the most portable format for exchanging matrix-like objects between different programs (statistical or numeric computing environments, spreadsheets etc.).

D.3 Data Frame Subsetting

D.3.1 Each Data Frame is a List

First of all, we should note that each data frame is in fact represented as an ordinary named list:

## [1] "data.frame"
## [1] "list"

Each column is stored as a separate list item. Having said that, we shouldn’t be surprised that we already know how to perform quite a few operations on data frames:

## [1] 3
## [1] "u" "v" "w"
## [1] 0.1815171 0.9197226 0.3117235 0.0641516 0.3964216
## [1]  TRUE FALSE FALSE  TRUE FALSE
##           u w
## 1 0.1815171 A
## 2 0.9197226 B
## 3 0.3117235 C
## 4 0.0641516 D
## 5 0.3964216 E
##         u         v         w 
## "numeric" "logical"  "factor"

D.3.2 Each Data Frame is Matrix-like

Data frames can be considered as “generalised” matrices. Therefore, operations such as subsetting will work in the same manner.

## [1] 5 3
##           u     v w
## 1 0.1815171  TRUE A
## 2 0.9197226 FALSE B
##           u w
## 1 0.1815171 A
## 2 0.9197226 B
## 3 0.3117235 C
## 4 0.0641516 D
## 5 0.3964216 E
## [1] 0.1815171 0.9197226 0.3117235 0.0641516 0.3964216
##           u
## 1 0.1815171
## 2 0.9197226
## 3 0.3117235
## 4 0.0641516
## 5 0.3964216

Take a special note of selecting rows based on logical vectors. For instance, let’s extract all the rows from x where the values in the column named u are between 0.3 and 0.6:

##           u     v w
## 3 0.3117235 FALSE C
## 5 0.3964216 FALSE E
##           u     v w
## 3 0.3117235 FALSE C
## 5 0.3964216 FALSE E

Moreover, subsetting based on integer vectors can be used to change the order of rows. Here is how we can sort the rows in x with respect to the values in column u:

##           u     v w
## 4 0.0641516  TRUE D
## 1 0.1815171  TRUE A
## 3 0.3117235 FALSE C
## 5 0.3964216 FALSE E
## 2 0.9197226 FALSE B

Let’s stress that the programming style we emphasise on here is very transparent. If we don’t understand how a complex operation is being executed, we can always decompose it into smaller chunks that can be studied separately. For instance, as far as the last example is concerned, we can take a look at the manual of ?order and then inspect the result of calling order(x$u).

On a side note, we can re-set the row names by referring to:

##           u     v w
## 1 0.0641516  TRUE D
## 2 0.1815171  TRUE A
## 3 0.3117235 FALSE C
## 4 0.3964216 FALSE E
## 5 0.9197226 FALSE B

D.4 Common Operations

We already know how to filter rows based on logical conditions, e.g.:

##     Petal.Width    Species
## 54          1.3 versicolor
## 56          1.3 versicolor
## 59          1.3 versicolor
## 65          1.3 versicolor
## 72          1.3 versicolor
## 74          1.2 versicolor
## 75          1.3 versicolor
## 83          1.2 versicolor
## 88          1.3 versicolor
## 89          1.3 versicolor
## 90          1.3 versicolor
## 91          1.2 versicolor
## 93          1.2 versicolor
## 95          1.3 versicolor
## 96          1.2 versicolor
## 97          1.3 versicolor
## 98          1.3 versicolor
## 100         1.3 versicolor
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 51          7.0         3.2          4.7         1.4 versicolor
## 53          6.9         3.1          4.9         1.5 versicolor
## 59          6.6         2.9          4.6         1.3 versicolor
## 66          6.7         3.1          4.4         1.4 versicolor
## 76          6.6         3.0          4.4         1.4 versicolor
## 77          6.8         2.8          4.8         1.4 versicolor
## 78          6.7         3.0          5.0         1.7 versicolor
## 87          6.7         3.1          4.7         1.5 versicolor

and aggregate information in individual columns:

##         Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min.        4.300000    2.000000        1.000    0.100000
## 1st Qu.     5.100000    2.800000        1.600    0.300000
## Median      5.800000    3.000000        4.350    1.300000
## Mean        5.843333    3.057333        3.758    1.199333
## 3rd Qu.     6.400000    3.300000        5.100    1.800000
## Max.        7.900000    4.400000        6.900    2.500000

Quite frequently, we will be interested in summarising data within subgroups generated by a list of factor-like variables.

##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026
##     len supp dose
## 12 16.5   VC  1.0
## 10  7.0   VC  0.5
## 17 13.6   VC  1.0
## 50 27.3   OJ  1.0
## 60 23.0   OJ  2.0
##   supp dose   len
## 1   OJ  0.5 12.25
## 2   VC  0.5  7.15
## 3   OJ  1.0 23.45
## 4   VC  1.0 16.50
## 5   OJ  2.0 25.95
## 6   VC  2.0 25.95

Taking into account that split() accepts a data frame input as well, we can perform what follows:

##              setosa versicolor virginica
## Sepal.Length  5.006      5.936     6.588
## Sepal.Width   3.428      2.770     2.974
## Petal.Length  1.462      4.260     5.552
## Petal.Width   0.246      1.326     2.026
##                        setosa versicolor virginica
## Sepal.Length.Min.    4.300000   4.300000  4.300000
## Sepal.Length.1st Qu. 5.100000   5.100000  5.100000
## Sepal.Length.Median  5.800000   5.800000  5.800000
## Sepal.Length.Mean    5.843333   5.843333  5.843333
## Sepal.Length.3rd Qu. 6.400000   6.400000  6.400000
## Sepal.Length.Max.    7.900000   7.900000  7.900000
## Petal.Length.Min.    1.000000   1.000000  1.000000
## Petal.Length.1st Qu. 1.600000   1.600000  1.600000
## Petal.Length.Median  4.350000   4.350000  4.350000
## Petal.Length.Mean    3.758000   3.758000  3.758000
## Petal.Length.3rd Qu. 5.100000   5.100000  5.100000
## Petal.Length.Max.    6.900000   6.900000  6.900000

The above syntax is not super-convenient, but it only uses the building blocks that we have already mastered! That should be very appealing to the minimalists. Note that R packages such as data.table and dplyr offer more convenient substitutes – you can always learn them on your own (which takes time, but it’s worth the hassle). They simplify the most common data wrangling tasks. Moreover, they have been optimised for speed – they can handle much larger data sets efficiently.

D.5 Metaprogramming and Formulas (*)

R (together with a few other programming languages such as Lisp and Scheme, that heavily inspired R’s semantics) allows its programmers to apply some metaprogramming techniques, that is, to write programs that manipulate unevaluated R expressions.

For instance, take a close look at the following plot:

Figure 1: Metaprogramming in action: Just take a look at the Y axis label

Figure 1: Metaprogramming in action: Just take a look at the Y axis label

How did the plot() function know that we are plotting sin of z (see Figure 1)? It turns out that, at any time, we not only have access to the value of an object (such as the result of evaluating sin(z), which is a vector of 101 reals) but also to the expression that was passed as a function’s argument itself.

## x equals to  9 
## x stemmed from  2 + 7

This is very powerful and yet potentially very confusing to the users, because we can write functions that don’t compute the arguments provided in a way we expect them to (i.e., following the R language specification). Each function can constitute a new micro-verse, where with its own rules – we should always refer to the documentation.

For instance, consider the subset() function:

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
##     Sepal.Length   Species
## 106          7.6 virginica
## 118          7.7 virginica
## 119          7.7 virginica
## 123          7.7 virginica
## 132          7.9 virginica
## 136          7.7 virginica

Neither Sepal.Length>6 nor -(Sepal.Width:Petal.Width) make sense as standalone R expressions! However, according to the subset() function’s own rules, the former expression is considered as a row selector (here, Sepal.Length refers to a particular column within the iris data frame). The latter plays the role of a column filter (select everything but all the columns between…).

The data.table and dplyr packages (which are very popular) rely on this language feature all the time, so we shouldn’t be surprised when we see them.

There is one more interesting language feature that is possible thanks to metaprogramming. Formulas are special R objects that consist of two unevaluated R expressions separated by a tilde (~). For example:

## len ~ supp + dose

A formula on its own has no meaning. However, many R functions accept formulas as arguments and can interpret them in various different ways.

For example, the lm() function that fits a linear regression model, uses formulas to specify the output and input variables:

## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length + Sepal.Width, data = iris)
## 
## Coefficients:
##  (Intercept)  Petal.Length   Sepal.Width  
##       2.2491        0.4719        0.5955

On the other hand, boxplot() (see Figure 2) allows for creating separate box-and-whisker plots for each subgroup given by a combination of factors.

Figure 2: Example box plot created via the formula interface

Figure 2: Example box plot created via the formula interface

The aggregate() function supports formulas too:

##      Species Sepal.Length Sepal.Width
## 1     setosa        5.006       3.428
## 2 versicolor        5.936       2.770
## 3  virginica        6.588       2.974

We should therefore make sure that we know how every function interacts with a formula – information on that can be found in ?lm, ?boxplot, ?aggregate and so forth.

D.6 Further Reading

Recommended further reading: (Venables et al. 2020)

Other: (Peng 2019), (Wickham & Grolemund 2017)

R packages dplyr and data.table implement the most common data frame wrangling procedures. You may find them very useful. Moreover, they are very fast even for large data sets. Additionally, the magrittr package provides a pipe operator, %>%, that simplifies the writing of complex, nested function calls. Do note that not everyone is a big fan of these, however.