A Guide to Content-Based Filtering In Recommender Systems

Aug 5, 2022•7 min read

Languages, frameworks, tools, and trends

Recommender systems play a crucial role in our online lives. Popular media and service platforms like Netflix, YouTube, Amazon, Facebook, etc., spend a significant percentage of revenue to deliver quality and personalized adverts and recommendations to drive sales and engagement. Users and customers benefit as well as they can buy products suited to their tastes and discover new, relevant ones. There are two types of recommender systems used for this: collaborative and content-based filtering. In this article, we’ll be looking at both and focusing on the content-based filtering algorithm.

Importance of using recommender systems

When there are numerous options to choose from, it’s natural to be confused, whether it’s selecting a flavor of ice cream or a model of headphones. Recommendation systems help by eliminating the options that do not align with our taste or past behavior. The more they have access to our purchasing history and patterns, the more accurate the recommendations are.

A downside to this approach can be a lack of good suggestions for new customers since the system has no previous data on their habits. To tackle this situation, other methods can be used such as explicitly asking the customer what type of content they want to view or suggesting items that are popular in their geographical location or age.

There are two types of recommender systems:

Collaborative filtering
Content-based filtering

Collaborative filtering

Collaborative filtering-based recommender systems solely rely on past interactions between users and items in order to suggest new products. The features of every individual item are not considered.

In collaborative filtering, the historical data of the user interacting with the items is recorded and stored. This is usually represented by a matrix known as user-item interaction matrix, where rows represent users and columns represent the items. Similar users are grouped and all their interactions are considered when making recommendations to the target user.

Collaborative filtering can be subdivided into two more groups: memory-based approach and model-based approach.

Memory-based collaborative approach

Memory-based approach relies solely on the user-item interaction matrix and mathematical calculations to find nearest neighbors and suggest new items. No machine learning (ML) models are used.

Model-based collaborative approach

An underlying model is used to presuppose the interactions. This model is later tuned and used to rank items the user has not interacted with yet. Items with a higher compatibility score are recommended to the user.

Content-based filtering

Content-based filtering in recommender systems leverages machine learning algorithms to predict and recommend new but similar items to the user. Recommending products based on their characteristics is only possible if there is a clear set of features for the product and a list of the user’s choices.

The recommender system stores previous user data like clicks, ratings, and likes to create a user profile. The more a customer engages, the more accurate future recommendations are.

To understand this, let’s use a simple example of how a content-based recommender system might work to suggest movies.

Let’s suppose there are four movies and a user has seen and liked the first two.

Pictorial representation of content-based filtering in recommender system.webp

The model automatically suggests the third movie rather than the fourth, since it is more similar to the first two. This similarity can be calculated based on a number of features like the actors and actresses in the movie, the director, the genre, the duration of the film, etc.

Important terms

Utility matrix

A utility matrix contains the interaction information between the user and the preferred items. Data gathered from the day-to-day activities of the user is saved in a structured format to find the likes and dislikes of different items the user has interacted with. A value is assigned to every interaction, known as the ‘degree of preference’.

Example of content-based filtering.webp

A few values are missing in the above example of a utility matrix. This is because some users do not interact with every item available on the platform. Note that the goal of the recommender model is to suggest new items based on this utility matrix.

User profile

A user profile is the collection of vectors that define a user’s preferences. The profile is based on the activities and tastes of the user; for example, user ratings, number of clicks on different items, thumbs up or thumbs down on content, etc. This information helps the recommender engine to best estimate newer suggestions.

Item profile

For content-based filtering, we require the different features of every individual item to represent their essential qualities. Going back to the movie example, some necessary attributes of movies that will help the recommender system distinguish between them are actors and actresses, director, year of release, genre, IMDb ratings, etc.

There are generally two popular methods used in content-based filtering: cosine distance and classification approach.

Cosine distance

Here, the cosine distance between the user and item vectors is used to determine preference. Let’s understand with an example: Our target user enjoys watching action movies and somewhat dislikes horror and thrillers. The vector for action movies has positive values and the vector for horror movies has a negative value for that particular user.

Now, consider a new movie released in the sci-fi action genre. Since our user prefers action movies, the cosine angle between the movie vector and the user vector will be a large positive fraction, resulting in a smaller angle which means it's a good recommendation for our user. If the cosine distance is large, we generally ignore the item since it's a bad recommendation.

Classification approach

Classification algorithms like Bayesian classifiers or decision tree models can be used to make recommendations. For example, every level of a decision tree can be used to filter out the various preferences of the user to make a more refined choice.

Content-based filtering: Advantages and disadvantages

Advantages

It is easily scalable to a large number of customers since the data of other users is not required for recommending something to a particular user.
Since the recommendations are based on the day-to-day activities of the user, all the preferences and parameters of the suggestions are finely tuned to the user’s choice. Therefore, the model can recommend specific niche items that other users might not be interested in.
The latest items can be suggested as soon as they are launched, without waiting for a census, since the features are readily available from the start.

Disadvantages

Building a content-based recommender engine requires a lot of domain knowledge since the feature selection of the items is mostly hard-coded into the system. Thus, the model is only as good as the knowledge of the one building it.
The model can recommend new items based on the present interest of the user. Hence, discovering and expanding to newer avenues that might interest the user is not possible.
The cold start problem is a significant drawback since the engine does not have sufficient information about a new user to start making suggestions.
It is hard to make new recommendations to not-so-active users.

Collaborative filtering vs content-based filtering for recommender systems

Content-based filtering methods require quite an amount of information about an item’s features, rather than its interactions with the user. For products like clothes, these features can be size, color, brand, material, etc., or in the case of movies, actors, genre, director, year of release, etc.
Collaborative filtering, on the other hand, uses historical interactions between the users and items to group users with similar tastes and suggest new items, which are popular to the group, to the target user.
Content-based filtering models are heavily based on domain knowledge since the item features are hand-engineered into the system. Collaborative filtering does not need such in-depth domain knowledge since all the embeddings are automatically learned.
Collaborative filtering systems require only the user behavior data, whereas content-based methods require both user and item data.

In this article, we discussed content-based filtering which is a type of recommender system. We also briefly touched on collaborative filtering, another class of recommender systems. We saw that the content-based approach employs two methods to make the suggestions: classification model approach and vector space, both of which have their advantages and disadvantages.

Recommender systems are used by organizations and companies to automate the process of suggesting new content and products to their consumer base. They are widely used in the current e-commerce/online business environment. The next time you are suggested something online that you seem to like, you know exactly how it ended up on your feed!

Author
Turing Staff