B.7 Factors

Factors are special kinds of vectors that are frequently used to store qualitative data, e.g., species, groups, types. Factors are convenient in situations where we have many observations, but the number of distinct (unique) values is relatively small.

B.7.1 Creating Factors

For example, the following character vector:

##  [1] "green" "green" "green" "red"   "green" "red"   "red"  
##  [8] "red"   "green" "blue"

can be converted to a factor by calling:

##  [1] green green green red   green red   red   red   green blue 
## Levels: blue green red

Note how different is the way factors are printed out on the console.

B.7.2 Levels

We can easily obtain the list unique labels:

## [1] "blue"  "green" "red"

Those can be re-encoded by calling:

##  [1] g g g r g r r r g b
## Levels: b g r

To create a contingency table (in the form of a named numeric vector, giving how many values are at each factor level), we call:

## fcol
## b g r 
## 1 5 4

B.7.3 Internal Representation (*)

Factors have a look-and-feel of character vectors, however, internally they are represented as integer sequences.

## [1] "factor"
## [1] "numeric"
##  [1] 2 2 2 3 2 3 3 3 2 1

These are always integers from 1 to M inclusive, where M is the number of levels. Their meaning is given by the levels() function: in the example above, the meaning of the codes 1, 2, 3 is, respectively, b, g, r.

If we wished to generate a factor with a specific order of labels, we could call:

##  [1] green green green red   green red   red   red   green blue 
## Levels: red green blue

We can also assign different labels upon creation of a factor:

##  [1] g g g r g r r r g b
## Levels: r g b

Knowing how factors are represented is important when we deal with factors that are built around data that look like numeric. This is because their conversion to numeric gives the internal codes, not the actual values:

## [1] 1 3 0 1 4 0 0 1 4
## Levels: 0 1 3 4
## [1] 2 3 1 2 4 1 1 2 4
## [1] 1 3 0 1 4 0 0 1 4

Moreover, that idea is labour-saving in contexts such as plotting of data that are grouped into different classes. For instance, here is a scatter plot for the Sepal.Length and Petal.Width variables in the iris dataset (which is an object of type data.frame, see below). Flowers are of different Species, and we wish to indicate which point belongs to which class:

plot of chunk unnamed-chunk-80

The above was possible because the Species column is a factor object with:

## [1] "setosa"     "versicolor" "virginica"

and the meaning of pch of 1, 2, 3, … is “circle”, “triangle”, “plus”, …, respectively. What’s more, there’s a default palette that maps consecutive integers to different colours:

## [1] "black"   "red"     "green3"  "blue"    "cyan"    "magenta"
## [7] "yellow"  "gray"

Hence, black circles mark irises from the 1st class, i.e., “setosa”.