9.3 MovieLens Dataset (*)

9.3.1 Dataset


Let us make a few recommendations based on the MovieLens-9/2018-Small dataset available at https://grouplens.org/datasets/movielens/latest/

The dataset consists of ca. 100,000 ratings to 9,000 movies by 600 users. Last updated 9/2018.

This is already a pretty large dataset! We might run into problems with memory usage and run-time.

The following examples are a bit more difficult to follow (programming-wise), therefore we mark them with (*).

See also https://movielens.org/ and (Harper & Konstan 2015).


##   movieId                    title
## 1       1         Toy Story (1995)
## 2       2           Jumanji (1995)
## 3       3  Grumpier Old Men (1995)
## 4       4 Waiting to Exhale (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## [1] 9742

##   userId movieId rating timestamp
## 1      1       1      4 964982703
## 2      1       3      4 964981247
## 3      1       6      4 964982224
## 4      1      47      5 964983815
## [1] 100836
## 
##   0.5     1   1.5     2   2.5     3   3.5     4   4.5     5 
##  1370  2811  1791  7551  5550 20047 13136 26818  8551 13211

9.3.2 Data Cleansing


movieIds should be re-encoded, as not every film is mentioned/rated in the database. We will re-map the movieIds to consecutive integers.

## [1] 610
## [1] 9724

We will use a sparse matrix data type (from R package Matrix) to store ratings data. We don’t want to run out of memory!

Sparse == many zeros.


## 6 x 18 sparse Matrix of class "dgCMatrix"
##                                         
## [1,] 4 4 4 5 5 3 5 4 5 5 5 5 3 5 4 5 3 3
## [2,] . . . . . . . . . . . . . . . . . .
## [3,] . . . . . . . . . . . . . . . . . .
## [4,] . . . 2 . . . . . . . . . . 2 5 1 .
## [5,] 4 . . . 4 . . 4 . . . . . . . . 5 2
## [6,] . 5 4 4 1 . . 5 4 . 3 4 . 3 . . 2 5

9.3.3 Item-Item Similarities


To recall, the cosine similarity between \(\mathbf{a},\mathbf{b}\in\mathbb{R}^m\) is given by:

\[ S_C(\mathbf{a},\mathbf{b}) = \frac{ \sum_{i=1}^m a_i b_i }{ \sqrt{ \sum_{i=1}^m a_i^2 } \sqrt{ \sum_{i=1}^m b_i^2 } } \]

In vector/matrix algebra notation (have you noticed this section is marked with (*)?), this is:

\[ S_C(\mathbf{a},\mathbf{b}) = \frac{\mathbf{a}^T \mathbf{b}}{ \sqrt{{\mathbf{a}^T \mathbf{a}}} \sqrt{{\mathbf{b}^T \mathbf{b}}} } \]

If \(\mathbf{A}\in\mathbb{R}^{m\times n}\) we can “almost” compute the all the \(n\) cosine similarities at once by applying:

\[ S_C(\mathbf{a},\mathbf{B}) = \frac{\mathbf{A}^T \mathbf{A}}{ \dots } \]


Cosine item-item similarities:

crossprod(A,B) gives \(\mathbf{A}^T \mathbf{B}\)

tcrossprod(A,B) gives \(\mathbf{A} \mathbf{B}^T\)

9.3.4 Example Recommendations



## # A tibble: 10 x 2
##    Title                                              SIC
##    <chr>                                            <dbl>
##  1 Monty Python's The Meaning of Life (1983)        1    
##  2 Monty Python's Life of Brian (1979)              0.611
##  3 Monty Python and the Holy Grail (1975)           0.514
##  4 House of Flying Daggers (Shi mian mai fu) (2004) 0.493
##  5 Hitchhiker's Guide to the Galaxy, The (2005)     0.455
##  6 Bowling for Columbine (2002)                     0.451
##  7 Shaun of the Dead (2004)                         0.446
##  8 O Brother, Where Art Thou? (2000)                0.445
##  9 Ghost World (2001)                               0.444
## 10 Full Metal Jacket (1987)                         0.443

## # A tibble: 10 x 2
##    Title                                               SIC
##    <chr>                                             <dbl>
##  1 Toy Story (1995)                                  1    
##  2 Toy Story 2 (1999)                                0.573
##  3 Jurassic Park (1993)                              0.566
##  4 Independence Day (a.k.a. ID4) (1996)              0.564
##  5 Star Wars: Episode IV - A New Hope (1977)         0.557
##  6 Forrest Gump (1994)                               0.547
##  7 Lion King, The (1994)                             0.541
##  8 Star Wars: Episode VI - Return of the Jedi (1983) 0.541
##  9 Mission: Impossible (1996)                        0.539
## 10 Groundhog Day (1993)                              0.534

…and so on.

9.3.5 Clustering


A cosine similarity matrix can be turned into a dissimilarity matrix:

Which enables us to perform, e.g., the cluster analysis of items:


Example movies in the 3rd cluster:

## 5
## Bottle Rocket (1996), Clerks (1994), Star Wars: Episode
## IV - A New Hope (1977), Swingers (1996), Monty Python's
## Life of Brian (1979), E.T. the Extra-Terrestrial (1982),
## Monty Python and the Holy Grail (1975), Star Wars:
## Episode V - The Empire Strikes Back (1980), Princess
## Bride, The (1987), Raiders of the Lost Ark (Indiana
## Jones and the Raiders of the Lost Ark) (1981), Star Wars:
## Episode VI - Return of the Jedi (1983), Blues Brothers,
## The (1980), Duck Soup (1933), Groundhog Day (1993), Back
## to the Future (1985), Young Frankenstein (1974), Indiana
## Jones and the Last Crusade (1989), Grosse Pointe Blank
## (1997), Austin Powers: International Man of Mystery
## (1997), Men in Black (a.k.a. MIB) (1997)

Example movies in the 5th cluster:

## 5
## Blown Away (1994), Flight of the Navigator (1986), Dick
## Tracy (1990), Mighty Aphrodite (1995), Postman, The
## (Postino, Il) (1994), Flirting With Disaster (1996),
## Living in Oblivion (1995), Safe (1995), Eat Drink Man
## Woman (Yin shi nan nu) (1994), Bullets Over Broadway
## (1994), Barcelona (1994), In the Name of the Father
## (1993), Six Degrees of Separation (1993), Maya Lin: A
## Strong Clear Vision (1994), Everyone Says I Love You
## (1996), Rebel Without a Cause (1955), Wings of Desire
## (Himmel über Berlin, Der) (1987), High Noon (1952),
## Afterglow (1997), Bulworth (1998)