Collaborative filtering is a system that predicts user behavior based on historical user data. From this, we can understand that this is used as a recommendation system. For example, Amazon recommends products or gives discounts based on historical user data or YouTube recommends videos based on your history.
Why do we need this though?
We want to encourage user engagement, by trying to predict what the user may enjoy or require so that we can get the user hooked to our service. We have designed unique algorithms for services like Netflix, Spotify, and Amazon so that we can give recommendations based on the user’s preference and remove redundant and irrelevant information.
There are three types of recommendation systems :
- Content-based filtering: We use the user’s historical activity data(like what movies he likes, what movies he doesn’t like) to make predictions about similar products that the user is going to like.
- Collaborative filtering: In this technique, we use the historical data of other preferences of other users (hence the word collaborative) to make predictions about what a particular user may like. Say, many users who have watched the movie Iron Man, have also watched Avengers. Hence, our system will recommend Avengers to the user who has only watched Iron Man.
- Hybrid model: As the name suggests, this system is a combination of the above two models.
Let us explore more about Collaborative Filtering! There are two types of collaborative filtering:
- Memory-based collaborative filtering
- Model-based collaborative filtering
Memory-based collaborative filtering
We use the user’s rating data to compute the similarity between the items(that the users have rated) or the users.
As you can see, the system has identified users who have a similar preference to the third user. Since the third user has not watched Films 3 and 4, we find the aggregate of ratings of similar users. We can understand from the image that Film 4 will be recommended to the user based on historical user data. But when you have a large dataset, we need some defined method to determine what item to recommend. The formulae for calculating the value of ratings for a user-based approach for user u and item i and for a set of k users are as follows:
Now for an item based system, we have the following formula,
where s(u,v) is the similarity.
In the neighborhood-based algorithms, we calculate the similarity between two users or items.
We use either Pearson correlation (see our post) or Cosine distance to calculate the similarity.
User-based similarity (for user u,v ) using Pearson Correlation is calculated as follows:
User-based similarity (for user u,v ) using Cosine Distance is calculated as follows:
You know the general working of memory-based CF. But there are two specific types of memory-based CF. They are:
- User-Based: The user-based approach tries to predict the rating based on rating information collected from similar users. This approach is dynamic and cannot have precomputed predictions
- Item Based: It is similar to user-based except for the fact that it is based on item similarity rather than user similarity. This approach is static and we can precompute similar items for recommendation.
Model-Based Collaborative Filtering
In this approach, we develop models using different machine learning algorithms and train them on the user and rating dataset. Algorithms like neural networks, bayesian networks, and clustering approaches are used. You can see our post for an explanation of different machine learning models. The most famous type of this approach is matrix factorization.
If there is feedback from the user, for example, a user has watched a particular movie or read a particular book and has given a rating, that can be represented in the form of a matrix where each row represents a particular user and each column represents a particular item. Since it is almost impossible that the user will rate every item, this matrix will have many unfilled values. This is called sparsity. Matrix factorization methods are used to find a set of latent factors and determine user preferences using these factors. Latent Information can be reported by analyzing user behavior. The latent factors are otherwise called features.
The rating matrix is a product of two smaller matrices – the item-feature matrix and the user-feature matrix.
Matrix factorization steps:
- Initialization of random user and item matrix
- The rating matrix is obtained by multiplying the user and the transposed item matrix
- The goal of matrix factorization is to minimize the loss function (the difference in the ratings of the predicted and actual matrices must be minimal). Each rating can be described as a dot product of a row in the user matrix and a column in the item matrix.
Where K is a set of (u, i) pairs, r(u, i) is the rating for item i by user u, and λ is a regularization term (used to avoid overfitting)
Challenges for Collaborative Filtering
- We require a huge dataset
- No data is available for new users
- Very few recommendations can be proposed
- It is difficult to find users with the same items (also known as sparsity)
- It cannot recommend items to users with unique tastes
- Traditional algorithms cannot work property with a large number of users and items(scalability)
- It most recommends popular products
Advantages of Collaborative Filtering
- It is a generalized technique
- The approach helps us capture the user’s change in preferences
- It gives us a tentative rating of the item even before the user has purchased it (Cold-start)
Some links in case you want to learn more about the topic
 Singh, Pradeep & Dutta Pramanik, Pijush & Choudhury, Prasenjit. (2020). Collaborative Filtering in Recommender Systems: Technicalities, Challenges, Applications, and Research Trends. 10.1201/9781003007210-8.