# B Vector Algebra in R

This chapter is a step-by-step guide to vector computations in R. It also explains the basic mathematical notation around vectors.

You’re encouraged to not only simply *read* the chapter,
but also to execute yourself the R code provided.
Play with it, do some experiments, get curious about
how R works. Read the documentation on the functions you are calling,
e.g., `?seq`

, `?sample`

and so on.

Technical and mathematical literature isn’t belletristic.
It requires *active* (*pro-active* even) thinking.
Sometimes going through a single page can take an hour. Or a day.
If you don’t understand something, keep thinking, go back, ask yourself
questions, take a look at other sources. This is not a *linear* process.
This is what makes it fun and creative.
To become a good programmer you need a lot of practice, there
are no shortcuts. But the whole endeavour is worth the hassle!

## B.1 Motivation

Vector and matrix algebra provides us with a convenient language for expressing computations on sequential and tabular data.

Vector and matrix algebra operations are supported by every major programming language – either natively (e.g., R, Matlab, GNU Octave, Mathematica) or via an additional library/package (e.g, Python with numpy, tensorflow or pytorch; C++ with Eigen/Armadillo; C, C++ or Fortran with LAPACK).

By using matrix notation, we generate more concise and readable code.

For instance, given two vectors \(\boldsymbol{x}=(x_1,\dots,x_n)\) and \(\boldsymbol{y}=(y_1,\dots,y_n)\) like:

Instead of writing:

`## [1] 9.11592`

to mean:

\[ \sqrt{ (x_1-y_1)^2 + (x_2-y_2)^2 + \dots + (x_n-y_n)^2 } = \sqrt{\sum_{i=1}^n (x_i-y_i)^2}, \]

which denotes the (Euclidean) distance between the two vectors (the square root of the sum of squared differences between the corresponding elements in \(\boldsymbol{x}\) and \(\boldsymbol{y}\)), we shall soon become used to writing:

`## [1] 9.11592`

or:

\[ \sqrt{(\boldsymbol{x}-\boldsymbol{y})^{T}(\boldsymbol{x}-\boldsymbol{y})} \]

or even:

\[ \|\boldsymbol{x}-\boldsymbol{y}\|_2 \]

In order to be able to read this notation, we only have to get to know the most common “building blocks”. There are just a few of them, but it’ll take some time until we become comfortable with their use.

What’s more, we should note that vectorised code might be
much faster than the `for`

loop-based one (a.k.a. “iterative” style):

```
library("microbenchmark")
n <- 10000
x <- runif(n) # n random numbers in [0,1]
y <- runif(n)
print(microbenchmark(
t1={
# "iterative" style
s <- 0
n <- length(x)
for (i in 1:n)
s <- s + (x[i]-y[i])^2
sqrt(s)
},
t2={
# "vectorised" style
sqrt(sum((x-y)^2))
}
), signif=3, unit='relative')
```

```
## Unit: relative
## expr min lq mean median uq max neval
## t1 127 123 102 117 93.9 91.7 100
## t2 1 1 1 1 1.0 1.0 100
```

## B.2 Numeric Vectors

### B.2.1 Creating Numeric Vectors

First let’s introduce a few ways with which we can create numeric vectors.

#### B.2.1.1 `c()`

The `c()`

function *c*ombines a given list of values to form a sequence:

`## [1] 1 2 3`

`## [1] 1 2 3 4 5 6 7 8`

Note that when we use the assignment operator, `<-`

or `=`

(both are
equivalent), printing of the output is suppressed:

`## [1] 1 2 3`

However, we can enforce it by parenthesising the whole expression:

`## [1] 1 2 3`

In order to determine that `x`

is indeed a numeric vector,
we call:

`## [1] "numeric"`

`## [1] "numeric"`

- Remark.
These two functions might return different results. For instance, in the next chapter we note that a numeric matrix will yield

`mode()`

of`numeric`

and`class()`

of`matrix`

.

What is more, we can get the number of elements in `x`

by calling:

`## [1] 3`

#### B.2.1.2 `seq()`

To create an arithmetic progression,
i.e., a sequence of equally-spaced numbers,
we can call the `seq()`

function

`## [1] 1 3 5 7 9`

If we access the function’s documentation (by executing `?seq`

in the console),
we’ll note that the function takes a couple of parameters:
`from`

, `to`

, `by`

, `length.out`

etc.

The above call is equivalent to:

`## [1] 1 3 5 7 9`

The `by`

argument can be replaced with `length.out`

, which gives the desired
size:

`## [1] 0.00 0.25 0.50 0.75 1.00`

Note that R supports partial matching of argument names:

`## [1] 0.00 0.25 0.50 0.75 1.00`

Quite often we need progressions with step equal to 1 or -1.
Such vectors can be generated by applying the `:`

operator.

`## [1] 1 2 3 4 5 6 7 8 9 10`

`## [1] -1 -2 -3 -4 -5 -6 -7 -8 -9 -10`

#### B.2.1.3 `rep()`

Moreover, `rep()`

replicates a given vector.
Check out the function’s documentation (see `?rep`

) for
the meaning of the arguments provided below.

`## [1] 1 1 1 1 1`

`## [1] 1 2 3 1 2 3 1 2 3 1 2 3`

`## [1] 1 1 2 2 2 2 3 3 3`

`## [1] 1 1 1 1 2 2 2 2 3 3 3 3`

#### B.2.1.4 Pseudo-Random Vectors

We can also generate vectors of pseudo-random values. For instance, the following generates 5 deviates from the uniform distribution (every number has the same probability) on the unit (i.e., \([0,1]\)) interval:

`## [1] 0.5649012 0.5588051 0.4414829 0.2076393 0.6696355`

We call such numbers pseudo-random, because they are generated arithmetically.
In fact, by setting the random number generator’s state (also called the *seed*),
we can obtain *reproducible* results.

`## [1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673`

`## [1] 0.0455565 0.5281055 0.8924190 0.5514350 0.4566147`

`## [1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673`

Note the difference between the uniform distribution on \([0,1]\) and the normal distribution with expected value of \(0\) and standard deviation of \(1\) (also called the standard normal distribution), see Figure 1.

```
par(mfrow=c(1, 2)) # align plots in one row and two columns
hist(runif(10000, 0, 1), col="white", ylim=c(0, 2500)); box()
hist(rnorm(10000, 0, 1), col="white", ylim=c(0, 2500)); box()
```

Another useful function samples a number of values from a given vector, either with or without replacement:

`## [1] 3 3 10 2 6 5 4 6`

`## [1] 9 5 3 8 1 4 6 10`

Note that if `n`

is a single number,
`sample(n, ...)`

is equivalent to `sample(1:n, ...)`

.
This is a dangerous behaviour than may lead to bugs in our code.
Read more at `?sample`

.

### B.2.2 Vector-Scalar Operations

Mathematically, we sometimes refer to a vector that is reduced to a single component
as a *scalar*. We are used to denoting such objects with lowercase letters
such as \(a, b, i, s, x\in\mathbb{R}\).

- Remark.
Note that some programming languages distinguish between atomic numerical entities and length-one vectors, e.g.,

`7`

vs.`[7]`

in Python. This is not the case in R, where`length(7)`

returns 1.

Vector-scalar arithmetic operations such as \(s\boldsymbol{x}\) (multiplication of a vector \(\boldsymbol{x}=(x_1,\dots, x_n)\) by a scalar \(s\)) result in a vector \(\boldsymbol{y}\) such that \(y_i=s x_i\), \(i=1,\dots,n\).

The same rule holds for, e.g., \(s+\boldsymbol{x}\), \(\boldsymbol{x}-s\), \(\boldsymbol{x}/s\).

`## [1] 0.5 5.0 50.0`

`## [1] 11 12 13 14 15`

`## [1] 0.0 0.2 0.4 0.6 0.8 1.0`

By \(-\boldsymbol{x}\) we will mean \((-1)\boldsymbol{x}\):

`## [1] 0.00 -0.25 -0.50 -0.75 -1.00`

Note that in R the same rule applies for exponentiation:

`## [1] 0 1 4 9 16 25`

`## [1] 1 2 4 8 16 32`

However, in mathematics, we are **not** used to writing
\(2^{\boldsymbol{x}}\) or \(\boldsymbol{x}^2\).

### B.2.3 Vector-Vector Operations

Let \(\boldsymbol{x}=(x_1,\dots,x_n)\) and \(\boldsymbol{y}=(y_1,\dots,y_n)\) be two vectors of identical lengths.

Arithmetic operations \(\boldsymbol{x}+\boldsymbol{y}\) and \(\boldsymbol{x}-\boldsymbol{y}\) are performed *elementwise*,
i.e., they result in a vector \(\boldsymbol{z}\) such that
\(z_i=x_i+y_i\) and \(z_i=x_i-y_i\), respectively, \(i=1,\dots,n\).

`## [1] 2 12 103 1004`

`## [1] 0 -8 -97 -996`

Although in mathematics we are **not** used to using any special notation
for elementwise multiplication, division and exponentiation, this is available in R.

`## [1] 1 20 300 4000`

`## [1] 1.000 0.200 0.030 0.004`

`## [1] 1e+00 1e+02 1e+06 1e+12`

- Remark.
`1e+12`

is a number written in the*scientific notation*. It means “1 times 10 to the power of 12”, i.e., \(1\times 10^{12}\). Physicists love this notation, because they are used to dealing with very small (think sizes of quarks) and very large (think distances between galaxies) entities.

Moreover, in R the **recycling rule** is applied if we perform elementwise
operations on vectors of *different* lengths – the shorter
vector is recycled as many times as needed to match the length of the longer
vector, just as if we were performing:

`## [1] 1 2 3 1 2 3 1 2 3 1 2 3`

Therefore:

`## [1] 1 2 3 4 5 6`

`## [1] 1 20 3 40 5 60`

`## [1] 1 20 300 4 50 600`

```
## Warning in 1:6 * c(1, 10, 100, 1000): longer object length is
## not a multiple of shorter object length
```

`## [1] 1 20 300 4000 5 60`

Note that a warning is not an error – we still get a result that makes sense.

### B.2.4 Aggregation Functions

R implements a couple of *aggregation* functions:

`sum(x)`

= \(\sum_{i=1}^n x_i=x_1+x_2+\dots+x_n\)`prod(x)`

= \(\prod_{i=1}^n x_i=x_1 x_2 \dots x_n\)`mean(x)`

= \(\frac{1}{n}\sum_{i=1}^n x_i\) – arithmetic mean`var(x)`

=`sum((x-mean(x))^2)/(length(x)-1)`

= \(\frac{1}{n-1} \sum_{i=1}^n \left(x_i - \frac{1}{n}\sum_{j=1}^n x_j \right)^2\) – variance`sd(x)`

=`sqrt(var(x))`

– standard deviation

see also: `min()`

, `max()`

, `median()`

, `quantile()`

.

- Remark.
Remember that you can always access the R manual by typing

`?functionname`

, e.g.,`?quantile`

.- Remark.
Note that \(\sum_{i=1}^n x_i\) can also be written as \(\displaystyle\sum_{i=1}^n x_i\) or even \(\displaystyle\sum_{i=1,\dots,n} x_i\). These all mean the sum of \(x_i\) for \(i\) from \(1\) to \(n\), that is, the sum of \(x_1\), \(x_2\), …, \(x_n\), i.e., \(x_1+x_2+\dots+x_n\).

`## [1] 0.4972778`

`## [1] 0.4899503`

`## [1] 0.0004653491`

`## [1] 0.9994045`

### B.2.5 Special Functions

Furthermore, R supports numerous mathematical functions, e.g.,
`sqrt()`

, `abs()`

, `round()`

, `log()`

, `exp()`

, `cos()`

, `sin()`

.
All of them are vectorised – when applied on a vector of length \(n\),
they yield a vector of length \(n\) in result.

For example, here is how we can compute the square roots of all the integers between 1 and 9:

```
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490
## [7] 2.645751 2.828427 3.000000
```

Vectorisation is super-convenient when it comes to, for instance, plotting (see Figure 2).

```
x <- seq(-2*pi, 6*pi, length.out=51)
plot(x, sin(x), type="l")
lines(x, cos(x), col="red") # add a curve to the current plot
```

**Exercise.**

Try increasing the `length.out`

argument to make the curves smoother.

### B.2.6 Norms and Distances

Norms are used to measure the *size* of an object.
Mathematically, we will also be interested in the following norms:

- Euclidean norm:
\[
\|\boldsymbol{x}\| = \|\boldsymbol{x}\|_2 = \sqrt{ \sum_{i=1}^n x_i^2 }
\]
this is nothing else than the
*length*of the vector \(\boldsymbol{x}\) - Manhattan (taxicab) norm: \[ \|\boldsymbol{x}\|_1 = \sum_{i=1}^n |x_i| \]
- Chebyshev (maximum) norm: \[ \|\boldsymbol{x}\|_\infty = \max_{i=1,\dots,n} |x_i| = \max\{ |x_1|, |x_2|, \dots, |x_n| \} \]

The above norms can be easily implemented by means of the building blocks introduced above. This is super easy:

`## [1] 2.236068`

`## [1] 3`

`## [1] 2`

Also note that all the norms easily generate the corresponding
*distances* (metrics) between two given points. In particular:

\[ \| \boldsymbol{x}-\boldsymbol{y} \| = \sqrt{ \sum_{i=1}^n \left(x_i-y_i\right)^2 } \]

gives the *Euclidean distance* (metric) between the two vectors.

`## [1] 1`

This is the “normal” distance that you learned at school.

### B.2.7 Dot Product (*)

What is more, given two vectors of identical lengths,
\(\boldsymbol{x}\) and \(\boldsymbol{y}\),
we define their *dot product* (a.k.a. *scalar* or *inner product*) as:

\[ \boldsymbol{x}\cdot\boldsymbol{y} = \sum_{i=1}^n x_i y_i. \]

Let’s stress that this is not the same as the elementwise vector multiplication in R – the result is a single number.

`## [1] 1`

- Remark.
(*) Note that the squared Euclidean norm of a vector is equal to the dot product of the vector and itself, \(\|\boldsymbol{x}\|^2 = \boldsymbol{x}\cdot\boldsymbol{x}\).

(*) Interestingly, a dot product has a nice geometrical interpretation:
\[
\boldsymbol{x}\cdot\boldsymbol{y} = \|\boldsymbol{x}\| \|\boldsymbol{y}\|
\cos\alpha
\]
where \(\alpha\) is the angle between the two vectors.
In other words, it is the product of the lengths of the two vectors
and the cosine of the angle between them.
Note that we can get the cosine part by computing the dot product
of the *normalised*
vectors, i.e., such that their lengths are equal to 1.

For example, the two vectors `u`

and `v`

defined above
can be depicted as in Figure 3.

We can compute the angle between them by calling:

`## [1] 1`

`## [1] 1.414214`

`## [1] 0.7071068`

`## [1] 45`

### B.2.8 Missing and Other Special Values

R has a notion of a missing (not-available) value. It is very useful in data analysis, as we sometimes don’t have an information on an object’s feature. For instance, we might not know a patient’s age if he was admitted to the hospital unconscious.

Operations on missing values generally result in missing values – that makes a lot sense.

`## [1] 12 14 NA 18 20`

`## [1] NA`

If we wish to compute a vector’s aggregate after all,
we can get rid of the missing values by calling `na.omit()`

:

`## [1] 3`

We can also make use of the `na.rm`

parameter of the `mean()`

function
(however, not every aggregation function has it – always refer to documentation).

`## [1] 3`

- Remark.
Note that in R, a dot has no special meaning.

`na.omit`

is as good of a function’s name or variable identifier as`na_omit`

,`naOmit`

,`NAOMIT`

,`naomit`

and`NaOmit`

. Note that, however, R is a case-sensitive language – these are all different symbols. Read more in the*Details*section of`?make.names`

.

Moreover, some arithmetic operations can result in infinities (\(\pm \infty\)):

`## [1] -Inf`

`## [1] Inf`

Also, sometimes we’ll get a *not-a-number*, `NaN`

. This is not a missing value,
but a “invalid” result.

`## Warning in sqrt(-1): NaNs produced`

`## [1] NaN`

`## Warning in log(-1): NaNs produced`

`## [1] NaN`

`## [1] NaN`

## B.3 Logical Vectors

### B.3.1 Creating Logical Vectors

In R there are 3 (!) logical values:
`TRUE`

, `FALSE`

and geez, I don’t know, `NA`

maybe?

`## [1] TRUE FALSE TRUE NA FALSE FALSE TRUE`

`## [1] TRUE FALSE NA TRUE FALSE NA`

`## [1] "logical"`

`## [1] "logical"`

`## [1] 6`

- Remark.
By default,

`T`

is a synonym for`TRUE`

and`F`

for`FALSE`

. This may be changed though so it’s better not to rely on these.

### B.3.2 Logical Operations

Logical operators such as `&`

(and) and `|`

(or)
are performed in the same manner as arithmetic ones, i.e.:

- they are elementwise operations and
- recycling rule is applied if necessary.

For example,

`## [1] TRUE`

`## [1] TRUE FALSE`

`## [1] TRUE FALSE TRUE TRUE`

The `!`

operator stands for logical elementwise negation:

`## [1] FALSE TRUE`

Generally, operations on `NA`

s yield `NA`

unless other solution
makes sense.

```
u <- c(TRUE, FALSE, NA)
v <- c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, NA, NA, NA)
u & v # elementwise AND (conjunction)
```

`## [1] TRUE FALSE NA FALSE FALSE FALSE NA FALSE NA`

`## [1] TRUE TRUE TRUE TRUE FALSE NA TRUE NA NA`

`## [1] FALSE TRUE NA`

### B.3.3 Comparison Operations

We can compare the corresponding elements of two numeric vectors
and get a logical vector in result.
Operators such as `<`

(less than), `<=`

(less than or equal),
`==`

(equal), `!=`

(not equal), `>`

(greater than) and `>=`

(greater than or equal)
are again elementwise and use the recycling rule if necessary.

`## [1] FALSE FALSE FALSE TRUE TRUE`

`## [1] TRUE TRUE FALSE FALSE`

`## [1] TRUE FALSE FALSE TRUE TRUE`

### B.3.4 Aggregation Functions

Also note the following operations on *logical* vectors:

`## [1] FALSE`

`## [1] TRUE`

Moreover:

`## [1] 6`

`## [1] 0.6`

The behaviour of `sum()`

and `mean()`

is dictated by the fact
that, when interpreted in numeric terms, `TRUE`

is interpreted as
numeric `1`

and `FALSE`

as `0`

.

`## [1] 0 1`

Therefore in the example above we have:

`## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE`

`## [1] 0 0 0 0 1 1 1 1 1 1`

`## [1] 6`

Yes, there are 6 values equal to TRUE (or 6 ones after conversion), the sum of zeros and ones gives the number of ones.

## B.4 Character Vectors

### B.4.1 Creating Character Vectors

Individual character strings can be created using double quotes or apostrophes. These are the elements of character vectors

`## [1] "a string"`

`## [1] "character"`

`## [1] "character"`

`## [1] 1`

`## [1] "aaa" "bb" "c" "aaa" "bb" "c"`

### B.4.2 Concatenating Character Vectors

To join (concatenate) the corresponding elements of two or more character vectors,
we call the `paste()`

function:

`## [1] "a 1" "b 2" "c 3"`

`## [1] "a1" "b2" "c3"`

Also note:

`## [1] "a 1" "b 2" "c 3"`

`## [1] "a 1" "b 2" "c 3" "a 4" "b 5" "c 6"`

`## [1] "a 1 !" "b 2 ?" "c 3 !" "a 4 ?" "b 5 !" "c 6 ?"`

### B.4.3 Collapsing Character Vectors

We can also collapse a sequence of strings to a single string:

`## [1] "abcd"`

`## [1] "a,b,c,d"`

## B.5 Vector Subsetting

### B.5.1 Subsetting with Positive Indices

In order to extract subsets (parts) of vectors, we use the square brackets:

`## [1] 10 20 30 40 50 60 70 80 90 100`

`## [1] 10`

`## [1] 100`

More than one element at a time can also be extracted:

`## [1] 10 20 30`

`## [1] 10 100`

For example, the `order()`

function returns the indices of the
smallest, 2nd smallest, 3rd smallest, …, the largest element in a given vector.
We will use this function when implementing our first classifier.

`## [1] 3 4 2 5 1`

Hence, we see that the smallest element in `y`

is at index 3
and the largest at index 1:

`## [1] 10`

`## [1] 50`

Therefore, to get a sorted version of `y`

, we call:

`## [1] 10 20 30 40 50`

We can also obtain the 3 largest elements by calling:

`## [1] 50 40 30`

### B.5.2 Subsetting with Negative Indices

Subsetting with a vector of negative indices, *excludes* the elements
at given positions:

`## [1] 20 30 40 50 60 70 80 90 100`

`## [1] 40 50 60 70 80 90 100`

`## [1] 40 60 70 90 100`

### B.5.3 Subsetting with Logical Vectors

We may also subset a vector \(\boldsymbol{x}\) of length \(n\) with a logical vector \(\boldsymbol{l}\) also of length \(n\). The \(i\)-th element, \(x_i\), will be extracted if and only if the corresponding \(l_i\) is true.

`## [1] 10 50 70 80 100`

This gets along nicely with comparison operators that yield logical vectors on output.

`## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE`

`## [1] 60 70 80 90 100`

`## [1] 10 20 80 90 100`

`## [1] 10 20 30 40 50 60 70 80 90`

`## [1] 20 30 40 50 60 70 80 90`

Of course, e.g., `x[x<max(x)]`

returns a new, independent object.
In order to remove the greatest element in `x`

permanently, we can write
`x <- x[x<max(x)]`

.

### B.5.4 Replacing Elements

Note that the three above vector indexing schemes (positive, negative, logical indices) allow for replacing specific elements with new values.

`## [1] 10 10000 10000 10000 10000 10000 10000 10000 10000 10000`

`## [1] 10 10000 10000 10000 10000 10000 10000 1 2 3`

### B.5.5 Other Functions

`head()`

and `tail()`

return, respectively, a few (6 by default) first and last elements
of a vector.

`## [1] 10 10000 10000 10000 10000 10000`

`## [1] 1 2 3`

Sometimes the `which()`

function can come in handy.
For a given logical vector, it returns all the indices
where `TRUE`

elements are stored.

`## [1] 1 3 4 7`

`## [1] 50 30 10 20 40`

`## [1] 1 5`

Note that `y[y>70]`

gives the same result
as `y[which(y>70)]`

but is faster (because it involves less operations).

`which.min()`

and `which.max()`

return the index of the smallest
and the largest element, respectively:

`## [1] 3`

`## [1] 1`

`## [1] 10`

`is.na()`

indicates which elements are missing values (`NA`

s):

`## [1] FALSE FALSE TRUE FALSE TRUE FALSE`

Therefore, to remove them from `z`

permanently,
we can write (compare `na.omit()`

, see also `is.finite()`

):

`## [1] 1 2 4 6`

## B.6 Named Vectors

### B.6.1 Creating Named Vectors

Vectors in R can be *named* – each element can be assigned a string label.

```
## a b c d e
## 20 40 99 30 10
```

Other ways to create named vectors include:

```
## a b c
## 1 2 3
```

```
## a b c
## 1 2 3
```

For instance, the `summary()`

function returns a named vector:

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.0 20.0 30.0 39.8 40.0 99.0
```

This gives the minimum, 1st quartile (25%-quantile), Median (50%-quantile), aritmetic mean, 3rd quartile (75%-quantile) and maximum.

Note that `x`

is still a numeric vector, we can perform various operations
on it as usual:

`## [1] 199`

```
## a b c d e
## 20 40 99 30 10
```

Names can be dropped by calling:

`## [1] 20 40 99 30 10`

`## [1] 20 40 99 30 10`

### B.6.2 Subsetting Named Vectors with Character String Indices

It turns out that extracting elements from a named vector can *also* be
performed by means of a vector of character string indices:

```
## a d b
## 20 30 40
```

```
## Median Mean
## 30.0 39.8
```

## B.7 Factors

Factors are *special* kinds of vectors that are frequently used to
store qualitative data, e.g., species, groups, types.
Factors are convenient in situations where we have many observations,
but the number
of distinct (unique) values is relatively small.

### B.7.1 Creating Factors

For example, the following character vector:

```
## [1] "green" "green" "green" "red" "green" "red" "red"
## [8] "red" "green" "blue"
```

can be converted to a factor by calling:

```
## [1] green green green red green red red red green blue
## Levels: blue green red
```

Note how different is the way factors are printed out on the console.

### B.7.2 Levels

We can easily obtain the list unique labels:

`## [1] "blue" "green" "red"`

Those can be re-encoded by calling:

```
## [1] g g g r g r r r g b
## Levels: b g r
```

To create a contingency table (in the form of a named numeric vector, giving how many values are at each factor level), we call:

```
## fcol
## b g r
## 1 5 4
```

### B.7.3 Internal Representation (*)

Factors have a look-and-feel of character vectors, however, internally they are represented as integer sequences.

`## [1] "factor"`

`## [1] "numeric"`

`## [1] 2 2 2 3 2 3 3 3 2 1`

These are always integers from `1`

to `M`

inclusive,
where `M`

is the number of levels.
Their meaning is given by the `levels()`

function:
in the example above, the meaning of the codes `1`

, `2`

, `3`

is,
respectively, `b, g, r`

.

If we wished to generate a factor with a specific order of labels, we could call:

```
## [1] green green green red green red red red green blue
## Levels: red green blue
```

We can also assign different labels upon creation of a factor:

```
## [1] g g g r g r r r g b
## Levels: r g b
```

Knowing how factors are represented is important when we deal
with factors that are built around data that *look like* numeric.
This is because their conversion to numeric
gives the internal codes, not the actual values:

```
## [1] 1 3 0 1 4 0 0 1 4
## Levels: 0 1 3 4
```

`## [1] 2 3 1 2 4 1 1 2 4`

`## [1] 1 3 0 1 4 0 0 1 4`

Moreover, that idea is labour-saving in contexts such as plotting
of data that are grouped into different classes.
For instance, here is a scatter plot
for the Sepal.Length and Petal.Width variables in the `iris`

dataset (which
is an object of type `data.frame`

, see below).
Flowers are of different Species, and we wish to indicate which point belongs
to which class:

`## [1] 5.1 5.4 7.0 6.2 6.3`

`## [1] 0.2 0.2 1.4 1.5 2.5`

```
## [1] setosa setosa versicolor versicolor virginica
## Levels: setosa versicolor virginica
```

`## [1] 1 1 2 2 3`

```
plot(iris$Sepal.Length, # x (it's a vector)
iris$Petal.Width, # y (it's a vector)
col=as.numeric(iris$Species), # colours
pch=as.numeric(iris$Species)
)
```

The above (see Figure 4) was possible because the Species column is a factor object with:

`## [1] "setosa" "versicolor" "virginica"`

and the meaning of `pch`

of 1, 2, 3, … is “circle”, “triangle”, “plus”, …,
respectively. What’s more, there’s a default palette that maps
consecutive integers to different colours:

```
## [1] "black" "red" "green3" "blue" "cyan" "magenta"
## [7] "yellow" "gray"
```

Hence, black circles mark irises from the 1st class, i.e., “setosa”.

## B.8 Lists

Numeric, logical and character vectors are *atomic* objects – each
component is of the same type. Let’s take a look at what happens when
we create an atomic vector out of objects of different types:

`## [1] "nine" "FALSE" "7" "TRUE"`

`## [1] 0 7 1 7`

In each case, we get an object of the most “general” type which is able to represent our data.

On the other hand, R *lists* are *generalised* vectors.
They can consist of arbitrary R objects, possibly of mixed types –
also other lists.

### B.8.1 Creating Lists

Most commonly, we create a generalised vector by calling the `list()`

function.

```
## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
## [16] "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
##
## [[3]]
## [1] 0.9568333 0.4533342 0.6775706
```

`## [1] "list"`

`## [1] "list"`

`## [1] 3`

There’s a more compact way to print a list on the console:

```
## List of 3
## $ : int [1:5] 1 2 3 4 5
## $ : chr [1:26] "a" "b" "c" "d" ...
## $ : num [1:3] 0.957 0.453 0.678
```

We can also convert an atomic vector to a list by calling:

```
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
```

### B.8.2 Named Lists

List, like other vectors, may be assigned a `names`

attribute.

```
## $a
## [1] 1 2 3 4 5
##
## $b
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
## [16] "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
##
## $c
## [1] 0.9568333 0.4533342 0.6775706
```

### B.8.3 Subsetting and Extracting From Lists

Applying a square brackets operator creates a sub-list, which is of type list as well.

```
## $b
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
## [16] "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
##
## $c
## [1] 0.9568333 0.4533342 0.6775706
```

```
## $a
## [1] 1 2 3 4 5
##
## $c
## [1] 0.9568333 0.4533342 0.6775706
```

```
## $a
## [1] 1 2 3 4 5
```

Note in the 3rd case we deal with a list of length one, not a numeric vector.

To *extract* (dig into) a particular (single) element, we use double square brackets:

`## [1] 1 2 3 4 5`

`## [1] 0.9568333 0.4533342 0.6775706`

The latter can equivalently be written as:

`## [1] 0.9568333 0.4533342 0.6775706`

### B.8.4 Common Operations

Lists, because of their generality (they can store any kind of object), have few dedicated operations. In particular, it neither makes sense to add, multiply, … two lists together nor to aggregate them.

However, if we wish to run some operation on each element, we can call list-apply:

```
## $x
## [1] 0.57263340 0.10292468 0.89982497 0.24608773 0.04205953
##
## $y
## [1] 0.3279207 0.9545036 0.8895393 0.6928034 0.6405068 0.9942698
##
## $z
## [1] 0.6557058 0.7085305 0.5440660
```

```
## $x
## [1] 0.3727061
##
## $y
## [1] 0.7499239
##
## $z
## [1] 0.6361008
```

The above computes the mean of each of the three numeric vectors
stored inside list `k`

.
Moreover:

```
## $x
## [1] 0.04205953 0.89982497
##
## $y
## [1] 0.3279207 0.9942698
##
## $z
## [1] 0.5440660 0.7085305
```

The built-in function `range(x)`

returns `c(min(x), max(x))`

.

`unlist()`

tries (it might not always be possible)
to unwind a list to a simpler, atomic form:

```
## x y z
## 0.3727061 0.7499239 0.6361008
```

Moreover, `split(x, f)`

classifies elements in a vector `x`

into subgroups defined by a factor (or an object coercible to)
of the same length.

```
x <- c( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
f <- c("a", "b", "a", "a", "c", "b", "b", "a", "a", "b")
split(x, f)
```

```
## $a
## [1] 1 3 4 8 9
##
## $b
## [1] 2 6 7 10
##
## $c
## [1] 5
```

This is very useful when combined with `lapply()`

and `unlist()`

.
For instance, here are the mean sepal lengths
for each of the three flower species in the famous `iris`

dataset.

```
## setosa versicolor virginica
## 5.006 5.936 6.588
```

By the way, if we take a look at the documentation of `?lapply`

,
we will note that that this function is defined as `lapply(X, FUN, ...)`

.
Here `...`

denotes the optional arguments that will be passed to `FUN`

.

In other words, `lapply(X, FUN, ...)`

returns a list `Y`

of length `length(X)`

such that `Y[[i]] <- FUN(X[[i]], ...)`

for each `i`

.
For example, `mean()`

has an additional argument `na.rm`

that
aims to remove missing values from the input vector.
Compare the following:

`## [1] 5.5 NA`

`## [1] 5.5 3.0`

Of course, we can always pass a custom (self-made) function object as well:

```
min_mean_max <- function(x) {
# the last expression evaluated in the function's body
# gives its return value:
c(min(x), mean(x), max(x))
}
lapply(k, min_mean_max)
```

```
## $x
## [1] 0.04205953 0.37270606 0.89982497
##
## $y
## [1] 0.3279207 0.7499239 0.9942698
##
## $z
## [1] 0.5440660 0.6361008 0.7085305
```

or, more concisely (we can skip the curly braces here – they are normally used to group many expressions into one; also, if we don’t plan to re-use the function again, there’s no need to give it a name):

```
## $x
## [1] 0.04205953 0.37270606 0.89982497
##
## $y
## [1] 0.3279207 0.7499239 0.9942698
##
## $z
## [1] 0.5440660 0.6361008 0.7085305
```