Collaborative Filtering in Apache Spark


One of the most important applications of machine learning in the business context is for the development of a recommender system. The most well known recommender engine is Amazon´s. There are two approaches to implement a recommender engine:

  • Collaborative filtering: Base the predictions from user’s past behavior (rated, ranked or purchased items) in order to recommend additional items to users who made similar decisions. We don’t need to know the features (content) of the users or the items
  • Content-based filtering: Base the predictions on the features of the items in order to recommend additional items with similar features

For our example, we’re going to use collaborative filtering approach because our dataset is in the format (user_id,item_id(movies),rating,timestamp) to recommend a list of movies to a user without knowing anything about the characteristics of the users and movies.

The driver for collaborative filtering comes from the assumption that people often get the best recommendations from someone with similar tastes and opinions to themselves. Apache Sparks implements Alternating Least Squares (ALS) algorithm as part of its MLLib library.

As the sample dataset, we’re going to use the movielens at The dataset summary is:

  • data file contains the full dataset recording when a user rating a movie. Each row contains (user_id, movie(item)_id, rating, timestamp) separated by a tab
  • user file contains the details of the users
  • item file contains the details of the movies
  • 100,000 rating (between 1 and 5)
  • 943 users
  • 1682 movies (items)

First step is to prepare the environment. Read the input file path from the environment variables and start a Spark Context as shown listing 01.


Listing 01

Next step is to represent the external data into an internal structure using a RDD and the case class Rating, by splitting each line by a tab separator and mapping fields to the case class Rating. Finally, we cache in memory the RDD because ALS algorithm is iterative and needs to access the data several times. Without caching, the RDD must repeatedly be recomputed each time ALS iterates. The code is shown in the listing 02.


Listing 02

Next step is to build the recommendation model.

It’s remarkable to say that there are two types of user preferences:

  1. Explicit preference. Treats each entry in the user-item matrix as explicit preference given by the user to the item. For example, a rating given to items by users
  2. Implicit preference. Getting an implicit feedback. For example, views, clicks, buy history

In this case, we’re using the ALS.train method as shown in the listing 03.


Listing 03

Next step is to evaluate the performance of the recommendation model as shown below in the Listing 04


Listing 04

Finally, if the performance is very good, we can start making recommendations for our users based on the former recommendation model. Let’s suppose that we want to make five movies recommendations for the user with id equal to 196, then we write the code as shown in the listing 05. Of course, to see the real name of the movies, we need to look for the u.item file by the returning values.


Listing 05

Now, we can apply this principles, knowledge and examples to your own solutions.


2 thoughts on “Collaborative Filtering in Apache Spark

  1. Hi,
    Is the definition of content-base filtering and collaborative filtering at the beginning of article opposite ?
    Sorry I was confused. I think for collaborative filtering, feature(content) of item/user is needed.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s