Collaborative Filtering From Scratch

Collaborative filtering system from scratch using fastai and PyTorch

Gaurav Adlakha
·
2025-05-15

Collaborative Filtering from Scratch

Have you ever wondered how Netflix recommends movies you might like, or how Amazon suggests products you might want to buy? Behind these recommendations is often a technique called collaborative filtering. In this post, we'll build a collaborative filtering system from scratch using PyTorch and fastai.

What is Collaborative Filtering?

Collaborative filtering is a technique that makes predictions about what a user might like based on the preferences of many other users. The basic idea is simple: if person A likes items 1, 2, and 3, and person B likes items 1, 2, and 4, then person A might also like item 4.

Let's dive right in and start building our own collaborative filtering system using the MovieLens dataset.

Setting Up

First, let's import the necessary libraries:

from fastai.imports import *
from fastai.collab import *
from fastai.tabular.all import *
set_seed(42)

The fastai library provides high-level components that make it easier to build deep learning models. We're using modules from collab and tabular for our collaborative filtering task, and setting a random seed for reproducibility.

Getting the Data

We'll use the MovieLens 100k dataset, which contains 100,000 movie ratings from 943 users on 1,682 movies:

path = untar_data(URLs.ML_100k)
path.ls()
(#23) [Path('/Users/gaurav.adlakha/.fastai/data/ml-100k/u.item'),Path('/Users/gaurav.adlakha/.fastai/data/ml-100k/u3.test'),Path('/Users/gaurav.adlakha/.fastai/data/ml-100k/u1.base'),Path('/Users/gaurav.adlakha/.fastai/data/ml-

The untar_data function downloads and extracts the dataset for us. We can see that there are many files in this dataset, including training and test splits.

Let's load the main ratings data:

ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                  names=['user','movie','rating','timestamp'])
ratings.head()
user movie rating timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596

Perfect! We now have a dataframe with four columns:

  • user: The ID of the user who gave the rating
  • movie: The ID of the movie being rated
  • rating: The rating given (1-5 stars)
  • timestamp: When the rating was given

This is the foundation we need to start building our collaborative filtering model. In the next part, we'll explore the data further and start building our model.

Understanding the Data

From our initial look, we can see that users are identified by numbers (like 196, 186, 22) and movies are also identified by numbers (242, 302, 377). The ratings range from 1 to 5, with 1 being the lowest and 5 being the highest.

The core idea behind collaborative filtering is to find patterns in these ratings. For example, if user 196 and user 186 both gave high ratings to similar movies, they might have similar tastes. This means if user 196 likes a movie that user 186 hasn't seen yet, we might recommend that movie to user 186.

Visualizing the Ratings Matrix

To get a better understanding of our data, let's create a cross-tabulation that shows how our most active users rated the most popular movies:

def get_top_ratings(df, n_users=20, n_movies=20):
    "Return crosstab of ratings from top users for top movies"
    top_users = df.user.value_counts().index[:n_users]
    top_movies = df.movie.value_counts().index[:n_movies]
    filtered = df[(df.user.isin(top_users)) & (df.movie.isin(top_movies))]
    return pd.crosstab(filtered.user, filtered.movie, filtered.rating, aggfunc='mean')

ratings_matrix = get_top_ratings(ratings)

This function selects the top 20 users who have rated the most movies and the top 20 movies that have received the most ratings. It then creates a matrix where rows represent users, columns represent movies, and each cell contains the rating that a user gave to a movie.

Movie Ratings Table

User/Movie 1 7 50 56 98 100 117 121 127 172 174 181 204 222
7 N/A 5.0 5.0 5.0 4.0 5.0 N/A 5.0 5.0 4.0 5.0 3.0 5.0 N/A
13 3.0 2.0 5.0 5.0 4.0 5.0 3.0 5.0 5.0 5.0 4.0 5.0 5.0 3.0
92 4.0 4.0 5.0 5.0 5.0 5.0 4.0 5.0 N/A 4.0 5.0 4.0 4.0 4.0
94 4.0 4.0 5.0 5.0 4.0 5.0 N/A 2.0 5.0 4.0 4.0 4.0 4.0 3.0

This matrix gives us a visual representation of our data. Each row is a user, each column is a movie, and the values are the ratings (1-5). The NaN values indicate that a user hasn't rated that particular movie.

Looking at this matrix, we can observe several patterns:

  1. Some users tend to give higher ratings overall (like user 276 and user 416), while others are more critical.
  2. Some movies are generally rated higher than others.
  3. There are many missing values (NaN), which is typical in recommendation systems - users only rate a small fraction of all available items.

This sparsity is one of the key challenges in building recommendation systems. We need to predict the missing values based on the patterns in the existing ratings.

The Dot Product: A Measure of Similarity

At the heart of collaborative filtering is the idea of representing users and items as vectors in a shared space. The similarity between a user and an item can then be calculated using the dot product of their vectors.

Let's see a simple example:

movie = np.array([0.98, 0.9, -0.9])
user = np.array([0.9, 0.8, -0.6])
(user*movie).sum()
np.float64(2.1420000000000003)

In this example, we have:

  • A movie vector [0.98, 0.9, -0.9]
  • A user vector [0.9, 0.8, -0.6]

The dot product is calculated by multiplying corresponding elements and then summing them: (0.98 × 0.9) + (0.9 × 0.8) + (-0.9 × -0.6) = 0.882 + 0.72 + 0.54 = 2.142

This value (2.142) can be interpreted as a measure of how well this movie matches this user's preferences. The higher the dot product, the better the match.

In a real recommendation system, these vectors would have more dimensions (often called "latent factors" or "embeddings"), and they would be learned from the data rather than manually specified. Each dimension might represent some aspect of movies or user preferences, such as genre, pace, tone, etc., though these dimensions are usually not interpretable.

When Users Don't Like Movies

We just saw that a high dot product between user and movie vectors can indicate a good match. But what happens when a user doesn't like a movie? Let's see:

movie = np.array([0.98,0.9,-0.9])
user = np.array([0.1,-1.0,-0.6])
(user*movie).sum()
np.float64(-0.262)

In this case, we get a negative value (-0.262). This makes intuitive sense: when a user's preferences are opposite to a movie's characteristics, the dot product becomes negative, indicating a poor match.

Let's break down what's happening:

  • (0.98 × 0.1) = 0.098 (small positive contribution)
  • (0.9 × -1.0) = -0.9 (large negative contribution)
  • (-0.9 × -0.6) = 0.54 (positive contribution from matching negative features)

The overall result is negative, suggesting this user would probably not enjoy this movie.

This simple example shows how embeddings can capture the essence of user preferences and movie characteristics in a way that allows us to predict how well they match.

Adding Movie Titles

So far, we've been working with movie IDs, which aren't very informative. Let's load the movie titles so we can better understand our data:

movies = pd.read_csv(path/'u.item', delimiter='|', header=None, encoding='latin1', 
                    usecols=[0,1], names=['movie','title'])
movies.head()
movie title
0 1 Toy Story (1995)
1 2 GoldenEye (1995)
2 3 Four Rooms (1995)
3 4 Get Shorty (1995)
4 5 Copycat (1995)

Now we can see that movie ID 1 corresponds to "Toy Story (1995)", movie ID 2 is "GoldenEye (1995)", and so on. This will be helpful when we start making predictions and recommendations.

Let's also look at our full ratings dataset again to remind ourselves what we're working with:

ratings
       user  movie  rating  timestamp
0       196    242       3  881250949
1       186    302       3  891717742
2        22    377       1  878887116
3       244     51       2  880606923
4       166    346       1  886397596
...     ...    ...     ...        ...
99995   880    476       3  880175444
99996   716    204       5  879795543
99997   276   1090       1  874795795
99998    13    225       2  882399156
99999    12    203       3  879959583

[100000 rows x 4 columns]

We have 100,000 ratings from 943 users on 1,682 movies. This is a substantial dataset that should allow us to build a reasonably good recommendation system.

The Core Idea: Learning Embeddings

Now that we understand the data and have seen how dot products can measure similarity, let's discuss the core idea behind our collaborative filtering model.

The goal is to learn an embedding vector for each user and each movie. These embeddings will be positioned in a shared space so that when a user likes a movie, their vectors are close together (resulting in a high dot product), and when a user dislikes a movie, their vectors are far apart (resulting in a low dot product).

For example, if a user enjoys action movies with fast-paced plots, their embedding might have high values in dimensions representing "action" and "fast-paced". Movies with similar characteristics would also have high values in these dimensions, leading to a high dot product and thus a high predicted rating.

Combining Ratings with Movie Titles

Now let's merge our ratings data with the movie titles to make our dataset more interpretable:

ratings= ratings.merge(movies)
ratings

Perfect! Now our ratings dataframe includes the movie titles alongside the IDs. This makes it much easier to understand what movies users are rating. For example, we can see that user 196 gave "Kolya (1996)" a rating of 3, and user 716 gave "Back to the Future (1985)" a perfect 5.

Setting Up for Model Training

Now, let's create a fastai CollabDataLoaders object, which will handle the data preparation for our collaborative filtering model:

dls= CollabDataLoaders.from_df(ratings, item_name='title',bs=64)
dls.show_batch()

The CollabDataLoaders class takes care of several important preprocessing steps:

  1. Splitting the data into training and validation sets
  2. Converting user IDs and movie titles into categorical variables
  3. Creating mini-batches for efficient training
  4. Handling the sparse nature of the data

Once we have our DataLoader set up, we can examine some basic information about our dataset:

n_users= len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_movies

This would tell us how many unique users and movies we have in our dataset. When the code runs successfully, we'd see that we have 943 users and 1,665 movies with ratings.

The fastai CollabDataLoaders provides a convenient abstraction over the data preparation process. It automatically handles the conversion of our raw dataframe into a format suitable for training collaborative filtering models, saving us from having to write a lot of boilerplate code.

Understanding the Dataset Dimensions

Let's first check how many unique users we have in our dataset:

n_users
944

We have 944 unique users in our dataset. This number will be important as we build our embedding matrices.

Creating Embedding Factors

Now, let's manually create embedding matrices for users and movies. These are randomly initialized tensors that will later be trained to capture patterns in our data:

user_factor = torch.randn(n_users,5)
movie_factor= torch.randn(n_movies,5)

We've created two tensors:

  1. user_factor: A tensor of shape (944, 5) containing random values
  2. movie_factor: A tensor of shape (1665, 5) containing random values

The number 5 here represents the embedding dimension - we're representing each user and movie as a 5-dimensional vector. This is a hyperparameter that we can adjust; larger values allow for more complex representations but require more data to train effectively.

Let's check the shapes of these tensors to confirm:

user_factor.shape
torch.Size([944, 5])

And for the movie factors:

movie_factor.shape
torch.Size([1665, 5])

These shapes confirm that we have 944 users and 1,665 movies, each represented by a 5-dimensional embedding vector.

Accessing Individual Embeddings

We can access the embedding for a specific user or movie by indexing into these tensors. For example, to get the embedding for user with ID 5:

torch.embedding(user_factor,torch.tensor(5))
tensor([-1.2018, -1.2946, -1.8869,  1.2259,  0.2970])

This shows the 5-dimensional embedding vector for user 5. These values are currently random, but after training, they will capture meaningful patterns about this user's preferences.

The Embedding Concept

These embedding vectors are at the heart of collaborative filtering. The idea is that after training:

  1. Users with similar tastes will have similar embedding vectors
  2. Movies with similar characteristics will have similar embedding vectors
  3. The dot product between a user's vector and a movie's vector will predict how much that user would like that movie

For example, if two users both enjoy sci-fi movies, their embedding vectors might have high values in similar dimensions. Similarly, if two movies are both action films, their embedding vectors might align in certain dimensions.

Using PyTorch's Embedding Layer

Rather than working with raw tensors, PyTorch provides a specialized nn.Embedding layer that's designed for exactly this use case. Let's create embedding layers for our users:

user_emb= nn.Embedding(n_users,5)

This creates an embedding layer for our users with 5 dimensions. The advantage of using nn.Embedding over raw tensors is that it integrates with PyTorch's automatic differentiation, making it easy to train these embeddings using gradient descent.

Let's look at the embedding for user 10:

user_emb.weight[10]
tensor([-1.0398, -1.7286, -0.6798,  2.5864, -0.1015],
       grad_fn=<SelectBackward0>)

This shows the initial random values for user 10's embedding. Note the grad_fn attribute, which indicates that this tensor is part of PyTorch's computation graph and can be updated during training.

For comparison, let's look at our original manually created embedding tensor:

user_factor
tensor([[-1.0827,  0.2138,  0.9310, -0.2739, -0.4359],
        [-0.5195,  0.7613, -0.4365,  0.1365,  1.3300],
        [-1.2804,  0.0705,  0.6489, -1.2110,  1.8266],
        ...,
        [ 0.8009, -0.4734, -0.8962, -0.7348, -0.0246],
        [ 0.3354, -0.8262, -0.1541,  0.4699,  0.4873],
        [ 2.4054, -0.2156, -1.4126, -0.2467,  1.0571]])

And specifically for user 10:

user_factor[10]
tensor([-0.5753,  0.1556, -0.3694,  0.4986, -2.5438])

The values are different because each was randomly initialized.

Finally, let's inspect the embedding layer itself:

user_emb
Embedding(944, 5)

This confirms that our embedding layer is set up correctly with 944 users and 5 dimensions per user.

Understanding PyTorch Embeddings

The nn.Embedding layer is essentially a lookup table. When we pass an index (like a user ID), it returns the corresponding row from its weight matrix. For example, when we access user_emb.weight[10], we're getting the 11th row (since indexing starts at 0) of the embedding weight matrix.

During training, these embedding weights will be updated to minimize our loss function. Users who rate similar movies similarly will end up with similar embedding vectors.

In the next section, we'll create a complete neural network model that uses these embeddings to predict ratings.

Understanding Embeddings as Matrix Operations

One important insight about embeddings is that they're equivalent to a specific matrix operation:

Taking the dot product with a one-hot encoding of a vector is the same as looking up that vector at a particular index.

This helps us understand what's happening "under the hood" with embedding layers. Let's explore this with some examples:

one_hot(3,100)
tensor([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0], dtype=torch.uint8)

The one_hot function creates a vector with all zeros except for a single 1 at the specified index. Here, we've created a one-hot vector of length 100 with a 1 at index 3.

Let's convert it to float type for matrix operations:

one_hot(3,100).float()
tensor([0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Now, if we had a matrix and multiplied it by this one-hot vector, it would effectively select the 4th row of the matrix (index 3, since we start counting from 0).

This is exactly what an embedding lookup does. When we access:

user_emb.weight[10]
tensor([-1.0398, -1.7286, -0.6798,  2.5864, -0.1015],
       grad_fn=<SelectBackward0>)

It's equivalent to multiplying the embedding weight matrix by a one-hot vector with a 1 at index 10.

Let's look at our embedding layer again:

user_emb
Embedding(944, 5)

This shows we have an embedding matrix of shape (944, 5), where each row corresponds to a user and each column represents a dimension of our embedding space.

Why This Matters

Understanding embeddings as matrix operations helps us grasp what's happening during the forward and backward passes of training:

  1. Forward pass: When we look up an embedding, we're selecting a specific row from the embedding matrix.

  2. Backward pass: When we compute gradients, we're only updating the specific rows that were used in the forward pass, not the entire matrix.

This is why embeddings are so efficient for representing categorical variables with many possible values. We only need to compute and update the embeddings for the specific categories present in each batch.

In the next section, we'll build a complete collaborative filtering model that uses these embedding layers to predict user ratings for movies.

Exploring Embeddings in More Detail

Let's explore the embedding concept more thoroughly by examining the shapes and operations involved:

user_emb.weight.shape
torch.Size([944, 5])

This confirms that our embedding matrix has 944 rows (one per user) and 5 columns (the embedding dimension).

Now let's look at the shape of a one-hot encoded vector for user 10:

one_hot(10,n_users).float().shape
torch.Size([944])

This is a vector with 944 elements (one per user), with all zeros except for a 1 at index 10.

When we multiply the transpose of our embedding matrix by this one-hot vector, we get the embedding for user 10:

user_emb.weight.t() @ one_hot(10,n_users).float()
tensor([-1.0398, -1.7286, -0.6798,  2.5864, -0.1015], grad_fn=<MvBackward0>)

This is exactly the same as what we get when we directly access user_emb.weight[10]. This demonstrates that an embedding lookup is mathematically equivalent to multiplying by a one-hot vector.

What is an Embedding?

An embedding is a learned mapping from discrete objects (like users or movies) to vectors of continuous numbers. In the context of collaborative filtering:

  1. Each user and movie is represented by a vector of floating-point numbers
  2. These vectors are learned during training to optimize a specific objective (like predicting ratings)
  3. The embedding vectors capture latent features that aren't explicitly provided in the data

Let's create another embedding layer to see how PyTorch initializes them:

u_e= Embedding(944, 5)

And look at the embedding for user 10:

u_e.weight[10]
tensor([0.0063, 0.0057, 0.0053, 0.0052, 0.0039], grad_fn=<SelectBackward0>)

Notice that this is different from our previous embedding for user 10. PyTorch initializes embeddings randomly, so each time we create a new embedding layer, we get different starting values. Finally, let's get a batch of data to see what our model will receive during training:

batch = dls.one_batch()

This gives us a single batch from our dataloader, which contains:

  1. Input data: user IDs and movie IDs
  2. Target data: the corresponding ratings

Understanding the Batch Structure

When we retrieve a batch from our dataloader, it contains two main components:

  1. Inputs: A tensor of shape (batch_size, 2) where each row contains a user ID and a movie ID
  2. Targets: A tensor of shape (batch_size,) containing the rating that each user gave to the corresponding movie

During training, our model will:

  1. Look up the embeddings for each user and movie in the batch
  2. Compute the dot product of each user-movie pair
  3. Compare the predicted ratings to the actual ratings
  4. Update the embeddings to minimize the prediction error

This is the essence of the collaborative filtering approach: learning embeddings that capture user preferences and movie characteristics in a way that allows us to predict how users will rate movies they haven't seen yet.

In the next section, we'll build a complete neural network model that implements this approach.

Building Our Collaborative Filtering Model

Now that we understand embeddings, let's put everything together to build a complete collaborative filtering model. First, let's verify the dimensions of our user and movie spaces:

944,1665
(944, 1665)

This confirms we have 944 users and 1,665 movies in our dataset.

When we get a batch of data, it contains user and movie IDs. Let's examine the movie IDs in a batch:

batch[0][:,1]
tensor([1330,  899,  230, 1391,  334, 1133,  897,  466,  668,  102,  236, 1443,
         528,  320, 1247,  256,  769,  143,  271, 1397,  210, 1544, 1442,  529,
          17,  611, 1052,  485,  623, 1525,  938,  503, 1544,   65,  816, 1227,
          93,  499,  179, 1179,  588, 1019,  304,    5,  710,  457,  861, 1006,
         320,  578,  899,   62,  177,  279,  328, 1496,  570, 1252, 1216, 1402,
         884,  457,  738, 1121])

These are the movie IDs in our batch. Each will be used to look up the corresponding movie embedding.

Let's create embeddings for users and movies and see what shape we get when we look up a batch:

us_em=Embedding(944,5)
mo_em=Embedding(1665,5)

batch = dls.one_batch()

x= batch[0][:,0]
y= batch[0][:,1]

print(us_em(x).shape, mo_em(y).shape)
torch.Size([64, 5]) torch.Size([64, 5])

This shows that for a batch of 64 examples, we get 64 user embeddings and 64 movie embeddings, each with 5 dimensions.

Implementing the Collaborative Filtering Model

Now let's implement our collaborative filtering model:

class CollabNN(nn.Module):
    "Simple collaborative filtering model with embeddings"
    def __init__(self, n_users, n_items, n_factors=5):
        super().__init__()
        self.user_factors = nn.Embedding(n_users, n_factors)
        self.item_factors = nn.Embedding(n_items, n_factors)
        
    def forward(self, x):
        users, items = x[:,0], x[:,1]
        u_embs = self.user_factors(users)
        i_embs = self.item_factors(items)
        return (u_embs * i_embs).sum(dim=1)

This model:

  1. Creates embedding layers for users and items
  2. In the forward pass, extracts user and item IDs from the input
  3. Looks up the corresponding embeddings
  4. Computes the element-wise product of user and item embeddings
  5. Sums along the embedding dimension to get a single prediction per user-item pair

Let's create an instance of our model and get a batch of data:

model = CollabNN(n_users, n_movies)
batch = dls.one_batch()

Now let's run the model on a batch to see what it predicts:

model(batch[0])
tensor([ 7.0029,  2.1804,  0.4648, -0.3704, -3.6570, -2.5748, -1.2354, -0.4618,
         0.8670, -0.7257, -1.3684,  0.6179,  0.6468,  2.9299,  1.4910, -0.4101,
         0.6326, -0.6567,  2.2093,  2.4966, -5.0130, -3.6183,  3.9021,  4.4451,
        -0.0432, -2.0280, -3.6852, -5.9757, -1.5701, -1.1312,  0.8875, -1.5192,
        -1.2604, -0.9187, -1.3469, -0.6555, -1.2011, -0.6149,  0.3042,  1.4095,
        -1.7217,  0.3008, -0.0148,  0.3080,  2.2792,  3.7195, -0.1592, -0.6061,
         1.7568, -0.7674,  0.2440, -0.9074, -1.0106, -3.1345,  0.0641,  1.2300,
         3.4579, -0.4415, -1.4399, -3.0345,  1.6182, -1.2363, -1.8696, -1.8537],
       grad_fn=<SumBackward1>)

Let's check the shape of these predictions:

model(batch[0]).shape
torch.Size([64])

We get 64 predictions, one for each user-item pair in our batch. These are the predicted ratings.

Training the Model

Now let's create a learner to train our model:

mdl =CollabNN(n_users, n_movies)
learner = Learner(dls, mdl, loss_func=MSELossFlat())

The Learner class from fastai takes care of the training loop, optimization, and other details. We use Mean Squared Error as our loss function, which is appropriate for regression tasks like rating prediction.

Let's train the model for 5 epochs using the one-cycle policy:

learner.fit_one_cycle(5, 5e-3)
epoch     train_loss  valid_loss  time    
0         16.511957   16.503958   00:05     
1         12.656537   13.187530   00:05     
2         4.729345    5.090328    00:06     
3         2.651412    3.003635    00:05     
4         2.324853    2.789970    00:05     

The training and validation losses decrease over time, indicating that our model is learning to predict ratings. By the end of training, we have a Mean Squared Error of around 2.3 on the training set and 2.8 on the validation set.

Finally, let's get a batch of data to use for making predictions:

x,y = dls.one_batch()

What We've Built

We've successfully built and trained a collaborative filtering model that can predict how users will rate movies they haven't seen yet. The model learns embeddings for both users and movies, capturing latent features that determine user preferences and movie characteristics.

The key steps were:

  1. Creating embedding layers for users and movies
  2. Building a neural network that computes the dot product of user and movie embeddings
  3. Training the model to minimize the prediction error on known ratings

This model can now be used to recommend movies to users by predicting ratings for movies they haven't seen and suggesting those with the highest predicted ratings.

Evaluating and Improving Our Model

Now that we've trained our basic collaborative filtering model, let's examine how well it's performing and explore ways to improve it.

The model we built has a simple structure: it computes the dot product between user and movie embeddings to predict ratings. While this works reasonably well, there are several improvements we can make.

Adding Bias Terms

One limitation of our current model is that it doesn't account for user and movie biases. Some users tend to give higher ratings overall, and some movies tend to receive higher ratings regardless of who's rating them.

Let's create an improved model that includes bias terms:

class CollabNN(nn.Module):
    "Collaborative filtering model with embeddings and bias terms"
    def __init__(self, n_users, n_items, n_factors=5, y_range=(0,5.5)):
        super().__init__()
        self.user_factors = nn.Embedding(n_users, n_factors)
        self.item_factors = nn.Embedding(n_items, n_factors)
        self.user_bias = nn.Embedding(n_users, 1)
        self.item_bias = nn.Embedding(n_items, 1)
        self.y_range = y_range
        
    def forward(self, x):
        users, items = x[:,0], x[:,1]
        u_embs = self.user_factors(users)
        i_embs = self.item_factors(items)
        u_bias = self.user_bias(users).squeeze()
        i_bias = self.item_bias(items).squeeze()
        dot = (u_embs * i_embs).sum(dim=1)
        return sigmoid_range((dot + u_bias + i_bias), *self.y_range)

This model adds several improvements:

  1. User and Item Biases: Each user and movie has a bias term that captures their general rating tendencies.

  2. Output Range: We use sigmoid_range to constrain the output to a specific range (0 to 5.5 in this case). This ensures our predictions are within the valid rating range.

Let's train this improved model:

mdl = CollabNN(n_users, n_movies)
learner = Learner(dls, mdl, loss_func=MSELossFlat())
learner.fit_one_cycle(5, 5e-3)
epoch     train_loss  valid_loss  time    
0         3.944084    3.918710    00:06     
1         1.731161    1.895458    00:06     
2         1.111672    1.324167    00:06     
3         1.025903    1.186654    00:06     
4         0.962344    1.169547    00:06 

The results are much better! Our validation loss has decreased from 2.79 to 1.17, which is a substantial improvement. This shows that adding bias terms and constraining the output range has helped our model make more accurate predictions.

Let's visualize the training progress:

def get_training_losses(learner):
    "Display training and validation losses from a fastai Learner as a DataFrame."
    losses = learner.recorder.values
    train_losses = [x[0] for x in losses]
    valid_losses = [x[1] for x in losses]
    
    return pd.DataFrame({
        'Epoch': range(1, len(train_losses)+1),
        'Training Loss': train_losses,
        'Validation Loss': valid_losses
    })
get_training_losses(learner)
Epoch Training Loss Validation Loss
0 1 3.944084 3.918710
1 2 1.731161 1.895458
2 3 1.111672 1.324167
3 4 1.025903 1.186654
4 5 0.962344 1.169547

The loss decreases rapidly in the first few epochs and then continues to improve more gradually. This is a typical learning curve for neural networks.

Understanding the Model's Predictions

Our model now predicts ratings in the range of 0 to 5.5. The Mean Squared Error of about 1.17 on the validation set indicates that, on average, our predictions are off by about 1.08 stars (the square root of the MSE).

This is quite good for a simple collaborative filtering model! It means that if a user would rate a movie 4 stars, our model might predict 3 or 5 stars, which is reasonable.

Further Enhancing Our Model

Looking at the remaining cells in the notebook, we can see that there's one more enhancement we can make to our collaborative filtering model: adding a global bias term.

class ModifiedCollabNN(nn.Module):
    "Collaborative filtering model with embeddings and bias terms"
    def __init__(self, n_users, n_items, n_factors=5, y_range=(0,5.5)):
        super().__init__()
        self.user_factors = nn.Embedding(n_users, n_factors)
        self.item_factors = nn.Embedding(n_items, n_factors)
        self.user_bias = nn.Embedding(n_users, 1)
        self.item_bias = nn.Embedding(n_items, 1)
        self.bias = nn.Parameter(torch.zeros(1))
        self.y_range = y_range
        
    def forward(self, x):
        users, items = x[:,0], x[:,1]
        u_embs = self.user_factors(users)
        i_embs = self.item_factors(items)
        u_bias = self.user_bias(users).squeeze()
        i_bias = self.item_bias(items).squeeze()
        dot = (u_embs * i_embs).sum(dim=1)
        return sigmoid_range(dot + u_bias + i_bias + self.bias, *self.y_range)

This model adds a global bias term (self.bias) that captures the overall average rating in the dataset. This gives us a complete model with:

  1. User embeddings
  2. Movie embeddings
  3. User bias terms
  4. Movie bias terms
  5. Global bias term
  6. Output range constraint

Let's train this enhanced model:

mdl = ModifiedCollabNN(n_users, n_movies)
learner = Learner(dls, mdl, loss_func=MSELossFlat())
learner.fit_one_cycle(5, 5e-3, wd=0.1)

The training results show that the model converges to a validation loss of around 0.87, which is even better than our previous model. The weight decay parameter (wd=0.1) helps prevent overfitting by regularizing the model parameters.

Let's visualize the training progress:

losses = learner.recorder.values

# <a id="get-training-and-validation-losses-(first-two-columns-of-values)"></a>Get training and validation losses (first two columns of values)
train_losses = [x[0] for x in losses]
valid_losses = [x[1] for x in losses]

# <a id="create-a-dataframe-to-display-them"></a>Create a dataframe to display them
pd.DataFrame({
    'Epoch': range(1, len(train_losses)+1),
    'Training Loss': train_losses,
    'Validation Loss': valid_losses
})
Epoch Training Loss Validation Loss
0 1 0.872213 1.016872
1 2 0.806509 0.916550
2 3 0.790299 0.886949
3 4 0.777781 0.872037
4 5 0.759944 0.869526

Comparing the Models

Let's compare the three models we've built:

  1. Basic Dot Product Model:

    • Simple dot product between user and movie embeddings
    • Final validation loss: ~2.79
  2. Model with Bias Terms and Output Range:

    • User and movie embeddings
    • User and movie bias terms
    • Output range constraint
    • Final validation loss: ~1.17
  3. Enhanced Model with Global Bias and Weight Decay:

    • User and movie embeddings
    • User and movie bias terms
    • Global bias term
    • Output range constraint
    • Weight decay regularization
    • Final validation loss: ~0.87

Each enhancement has significantly improved the model's performance, reducing the validation loss by more than 3x from the basic model to the enhanced model.

Conclusion

In this tutorial, we've built increasingly sophisticated collaborative filtering models for movie recommendations. We started with a simple dot product model and progressively added bias terms, output constraints, and regularization to improve performance.

The key takeaways are:

  1. Embeddings: The foundation of collaborative filtering is learning meaningful embeddings for users and items.

  2. Bias Terms: Adding bias terms significantly improves performance by capturing user and item tendencies.

  3. Output Constraints: Ensuring predictions are within a valid range improves the model's practicality.

  4. Regularization: Weight decay helps prevent overfitting and improves generalization.

The final model achieves a Mean Squared Error of around 0.87 on the validation set, which means that, on average, our predictions are off by about 0.93 stars (the square root of the MSE). This is quite good for a collaborative filtering model and would be practical for a real-world recommendation system.

This approach can be extended in various ways, such as incorporating additional features, using more complex architectures, or applying different regularization techniques. However, even this relatively simple model provides strong performance and demonstrates the power of collaborative filtering for recommendation systems.


© 2024 Gaurav. All rights reserved.