Content-Based Recommendation System
Content-based recommendation systems are a fundamental approach in building intelligent recommendation engines. The core principle is straightforward yet powerful: calculate the similarity between items based on their features, and then recommend items that are most similar to those the user has previously enjoyed. This method is particularly effective when you have rich metadata about your items, such as genres for movies, categories for products, or topics for articles.
Understanding Similarity Calculation Methods
Before diving into implementation, it's crucial to understand the mathematical foundations behind similarity measurement. Two of the most commonly used methods are:
1. Cosine Similarity
Cosine similarity measures the cosine of the angle between two vectors, focusing on their orientation rather than magnitude. This makes it especially suitable for high-dimensional sparse data like text or categorical features.
The mathematical formula is:
$$\text{Cosine Similarity}(A, B) = \frac{A \cdot B}{||A|| \times ||B||}$$
Where:
- $A \cdot B$ represents the dot product of vectors A and B
- $||A||$ and $||B||$ are the magnitudes (Euclidean norms) of vectors A and B
Key characteristics:
- Range: -1 (completely opposite) to 1 (identical)
- A value of 0 indicates no correlation between vectors
- Insensitive to vector length, making it ideal for text-based features where document length varies
- Computationally efficient for large datasets
2. Pearson Correlation Coefficient
The Pearson correlation coefficient measures the linear relationship between two variables. It quantifies how well the relationship between two variables can be described by a straight line.
The mathematical formula is:
$$\rho = \frac{\text{cov}(X, Y)}{\sigma_X \sigma_Y}$$
Where:
- $\text{cov}(X, Y)$ is the covariance between variables X and Y
- $\sigma_X$ and $\sigma_Y$ are the standard deviations of X and Y
Key characteristics:
- Range: -1 to 1
- +1 indicates perfect positive correlation
- -1 indicates perfect negative correlation
- 0 indicates no linear correlation
- Particularly useful for comparing user rating patterns across different items
Why Content-Based Recommendations Matter
Content-based filtering offers several compelling advantages from both technical and business perspectives:
For Users:
- Provides personalized experiences that grow with their preferences
- Offers transparency in recommendations (users can understand why items were recommended)
- Reduces the "filter bubble" effect by focusing on item features rather than just popularity
For Businesses:
- Increases user engagement and time-on-platform
- Improves conversion rates through better product matching
- Requires less data infrastructure compared to collaborative filtering
- Enables cold-start problem mitigation for new items
For Developers:
- More predictable behavior and easier debugging
- Simpler to implement and maintain
- Better suited for niche markets with specialized content
Building the Movie Feature Vector
Let's begin implementing our content-based recommendation system using the MovieLens dataset, a widely-used benchmark in recommendation systems research.
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MultiLabelBinarizer
# === 1. Loading and Preparing Movie Data ===
# Load the MovieLens dataset
movies = pd.read_csv('https://files.grouplens.org/datasets/movielens/ml-latest-small/movies.csv')
ratings = pd.read_csv('https://files.grouplens.org/datasets/movielens/ml-latest-small/ratings.csv')
# Display basic information about our datasets
print(f"Movies dataset shape: {movies.shape}")
print(f"Ratings dataset shape: {ratings.shape}")
print("\nMovies sample:")
print(movies.head())
print("\nRatings sample:")
print(ratings.head())
# Parse the genre information - genres are stored as pipe-separated strings
movies['genres_list'] = movies['genres'].str.split('|')
# Display parsed genres for verification
print("\nParsed genres for first 5 movies:")
for idx, row in movies.head().iterrows():
print(f"{row['title']}: {row['genres_list']}")
One-Hot Encoding for Categorical Features
To convert categorical genre data into numerical vectors, we'll use One-Hot Encoding, which creates a binary column for each unique genre.
# Apply One-Hot Encoding to genre lists
mlb = MultiLabelBinarizer()
genre_matrix = mlb.fit_transform(movies['genres_list'])
genre_df = pd.DataFrame(genre_matrix, columns=mlb.classes_, index=movies['movieId'])
# Display the resulting feature matrix
print(f"Movie feature vector dimensions: {genre_df.shape}")
print("\nGenre feature matrix (first 10 rows):")
print(genre_df.head(10))
# Verify the feature names
print(f"\nUnique genres identified: {list(mlb.classes_)}")
Computing the Movie Similarity Matrix
With our feature vectors established, we can now calculate pairwise similarities between all movies in our dataset.
# Calculate cosine similarity between all movie pairs
movie_similarity = cosine_similarity(genre_df)
# Convert the similarity matrix to a DataFrame for easier manipulation
movie_similarity_df = pd.DataFrame(
movie_similarity,
index=genre_df.index,
columns=genre_df.index
)
# Display matrix statistics
print(f"Similarity matrix size: {movie_similarity_df.shape}")
print(f"Similarity range: {movie_similarity_df.min().min():.4f} to {movie_similarity_df.max().max():.4f}")
# Show a sample of the similarity matrix
print("\nSample similarity matrix (first 5x5):")
print(movie_similarity_df.iloc[:5, :5])
Visualizing Similarity Patterns
Understanding the distribution of similarity scores helps us tune our recommendation parameters effectively.
import matplotlib.pyplot as plt
import seaborn as sns
# Create a heatmap visualization of the similarity matrix
plt.figure(figsize=(12, 10))
sns.heatmap(movie_similarity_df.iloc[:20, :20],
cmap='viridis',
annot=True,
fmt='.2f',
xticklabels=False,
yticklabels=False)
plt.title('Movie Similarity Matrix Heatmap (First 20 Movies)')
plt.tight_layout()
plt.savefig('similarity_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()
# Plot the distribution of similarity scores
plt.figure(figsize=(10, 6))
similarity_values = movie_similarity_df.values.flatten()
plt.hist(similarity_values, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Cosine Similarity Score')
plt.ylabel('Frequency')
plt.title('Distribution of Movie Similarity Scores')
plt.axvline(x=0.5, color='red', linestyle='--', label='Threshold (0.5)')
plt.legend()
plt.tight_layout()
plt.savefig('similarity_distribution.png', dpi=150, bbox_inches='tight')
plt.show()
Implementing Movie-to-Movie Recommendations
Now let's create a function that, given a specific movie ID, returns the most similar movies based on our computed similarity matrix.
def recommend_content_based(movie_id, n_recommendations=5):
"""
Recommend the N most similar movies to a given movie ID.
Parameters:
-----------
movie_id : int
The ID of the movie to find similar movies for
n_recommendations : int
Number of similar movies to return (default: 5)
Returns:
--------
list of dict
List of recommended movies with their similarity scores
"""
# Check if the movie exists in our dataset
if movie_id not in movie_similarity_df.index:
print(f"Warning: Movie ID {movie_id} not found in dataset")
return []
# Get similarity scores for the target movie
similarity_scores = movie_similarity_df[movie_id]
# Sort by similarity (descending) and remove the movie itself
similar_movies = similarity_scores.sort_values(ascending=False)
similar_movies = similar_movies.drop(movie_id)
# Take top N recommendations
top_movies = similar_movies.head(n_recommendations)
# Compile results with movie information
results = []
for mid, score in top_movies.items():
movie_info = movies[movies['movieId'] == mid].iloc[0]
results.append({
'movieId': int(mid),
'title': movie_info['title'],
'genres': movie_info['genres'],
'similarity_score': round(score, 4)
})
return results
# Test with Toy Story (movieId = 1)
toy_story_id = 1
toy_story_recs = recommend_content_based(toy_story_id, n_recommendations=10)
print(f"=== Movies Similar to '{movies[movies['movieId'] == toy_story_id]['title'].values[0]}' ===")
for i, rec in enumerate(toy_story_recs, 1):
print(f"{i}. {rec['title']:50s} | Similarity: {rec['similarity_score']:.4f} | Genres: {rec['genres']}")
User-Centric Content-Based Recommendations
The true power of content-based filtering emerges when we extend it to recommend movies for individual users based on their viewing history and ratings.
def recommend_for_user_content_based(user_id, n_recommendations=10):
"""
Generate content-based recommendations for a specific user.
Parameters:
-----------
user_id : int
The ID of the user to generate recommendations for
n_recommendations : int
Number of movies to recommend (default: 10)
Returns:
--------
list of dict
List of recommended movies with their aggregate scores
"""
# Get all movies rated by this user
user_ratings = ratings[ratings['userId'] == user_id]
rated_movie_ids = user_ratings['movieId'].tolist()
# Handle cold-start case: user has no ratings
if len(rated_movie_ids) == 0:
print(f"Warning: User {user_id} has no ratings in the dataset")
return []
# Identify user's favorite movies (top 5 highest-rated)
favorite_movies = user_ratings.sort_values('rating', ascending=False)
top_rated = favorite_movies.head(5)
print(f"\nUser {user_id}'s Top Rated Movies:")
for _, row in top_rated.iterrows():
movie_info = movies[movies['movieId'] == row['movieId']].iloc[0]
print(f" โข {movie_info['title']} (Rating: {row['rating']:.1f}/5.0)")
# Aggregate similarity scores from all favorite movies
candidate_scores = {}
for _, row in top_rated.iterrows():
movie_id = row['movieId']
# Weight similarity by rating (higher ratings = higher weight)
weight = row['rating'] / 5.0
if movie_id in movie_similarity_df.index:
# Get similar movies for this favorite
similar = movie_similarity_df[movie_id].sort_values(ascending=False)
similar = similar.drop(movie_id)
# Accumulate weighted scores for unseen movies
for similar_id, score in similar.items():
if similar_id not in rated_movie_ids: # Exclude already rated movies
candidate_scores[similar_id] = candidate_scores.get(similar_id, 0) + score * weight
# Sort candidates by aggregate score and take top N
sorted_candidates = sorted(candidate_scores.items(), key=lambda x: x[1], reverse=True)
top_candidates = sorted_candidates[:n_recommendations]
# Format results
results = []
for movie_id, score in top_candidates:
movie_info = movies[movies['movieId'] == movie_id].iloc[0]
results.append({
'title': movie_info['title'],
'genres': movie_info['genres'],
'aggregate_score': round(score, 4)
})
return results
# Demonstrate with User 1
user_id = 1
recommendations = recommend_for_user_content_based(user_id)
print(f"\n=== Content-Based Recommendations for User {user_id} ===")
for i, rec in enumerate(recommendations, 1):
print(f"{i}. {rec['title']:50s} | Score: {rec['aggregate_score']:.4f} | Genres: {rec['genres']}")
Advanced Feature Engineering
Our current implementation uses only genre information. Let's enhance it by incorporating additional movie features like release year and title keywords.
# Extract year from movie title
movies['year'] = movies['title'].str.extract(r'(\d{4})', expand=False)
# Create additional features
movies['title_words'] = movies['title'].str.replace(r'\s*\(\d{4}\)', '', regex=True).str.lower()
# Combine multiple features into a single text representation
movies['combined_features'] = movies['genres'] + ' ' + movies['year'].fillna('').astype(str) + ' ' + movies['title_words']
print("Enhanced feature combinations:")
print(movies[['title', 'combined_features']].head())
# Alternative approach: Use TF-IDF on combined features
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix = tfidf.fit_transform(movies['combined_features'])
# Compute similarity using TF-IDF features
tfidf_similarity = cosine_similarity(tfidf_matrix)
tfidf_similarity_df = pd.DataFrame(tfidf_similarity,
index=movies['movieId'],
columns=movies['movieId'])
print(f"\nTF-IDF similarity matrix shape: {tfidf_similarity_df.shape}")
Performance Optimization
For production systems, we need to optimize our similarity calculations and storage.
import pickle
import scipy.sparse as sp
# Save the similarity matrix to disk for faster loading
sparse_similarity = sp.csr_matrix(movie_similarity_df.values)
with open('movie_similarity_matrix.pkl', 'wb') as f:
pickle.dump({
'similarity_matrix': sparse_similarity,
'movie_ids': list(movie_similarity_df.index),
'genres': list(mlb.classes_)
}, f)
print("Similarity matrix saved to 'movie_similarity_matrix.pkl'")
# Load and verify
with open('movie_similarity_matrix.pkl', 'rb') as f:
loaded_data = pickle.load(f)
print(f"Loaded matrix shape: {loaded_data['similarity_matrix'].shape}")
print(f"Number of genres: {len(loaded_data['genres'])}")
Vibe Coding Implementation Guide
๐ฅ [Content-Based Recommendation Chant Example]