Content-Based Recommendation System

Content-based recommendation systems are a fundamental approach in building intelligent recommendation engines. The core principle is straightforward yet powerful: calculate the similarity between items based on their features, and then recommend items that are most similar to those the user has previously enjoyed. This method is particularly effective when you have rich metadata about your items, such as genres for movies, categories for products, or topics for articles.

Understanding Similarity Calculation Methods

Before diving into implementation, it's crucial to understand the mathematical foundations behind similarity measurement. Two of the most commonly used methods are:

1. Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors, focusing on their orientation rather than magnitude. This makes it especially suitable for high-dimensional sparse data like text or categorical features.

The mathematical formula is:

$$\text{Cosine Similarity}(A, B) = \frac{A \cdot B}{||A|| \times ||B||}$$

Where:

$A \cdot B$ represents the dot product of vectors A and B
$||A||$ and $||B||$ are the magnitudes (Euclidean norms) of vectors A and B

Key characteristics:

Range: -1 (completely opposite) to 1 (identical)
A value of 0 indicates no correlation between vectors
Insensitive to vector length, making it ideal for text-based features where document length varies
Computationally efficient for large datasets

2. Pearson Correlation Coefficient

The Pearson correlation coefficient measures the linear relationship between two variables. It quantifies how well the relationship between two variables can be described by a straight line.

The mathematical formula is:

$$\rho = \frac{\text{cov}(X, Y)}{\sigma_X \sigma_Y}$$

Where:

$\text{cov}(X, Y)$ is the covariance between variables X and Y
$\sigma_X$ and $\sigma_Y$ are the standard deviations of X and Y

Key characteristics:

Range: -1 to 1
+1 indicates perfect positive correlation
-1 indicates perfect negative correlation
0 indicates no linear correlation
Particularly useful for comparing user rating patterns across different items

Why Content-Based Recommendations Matter

Content-based filtering offers several compelling advantages from both technical and business perspectives:

For Users:

Provides personalized experiences that grow with their preferences
Offers transparency in recommendations (users can understand why items were recommended)
Reduces the "filter bubble" effect by focusing on item features rather than just popularity

For Businesses:

Increases user engagement and time-on-platform
Improves conversion rates through better product matching
Requires less data infrastructure compared to collaborative filtering
Enables cold-start problem mitigation for new items

For Developers:

More predictable behavior and easier debugging
Simpler to implement and maintain
Better suited for niche markets with specialized content

Building the Movie Feature Vector

Let's begin implementing our content-based recommendation system using the MovieLens dataset, a widely-used benchmark in recommendation systems research.

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MultiLabelBinarizer

# === 1. Loading and Preparing Movie Data ===

# Load the MovieLens dataset
movies = pd.read_csv('https://files.grouplens.org/datasets/movielens/ml-latest-small/movies.csv')
ratings = pd.read_csv('https://files.grouplens.org/datasets/movielens/ml-latest-small/ratings.csv')

# Display basic information about our datasets
print(f"Movies dataset shape: {movies.shape}")
print(f"Ratings dataset shape: {ratings.shape}")
print("\nMovies sample:")
print(movies.head())
print("\nRatings sample:")
print(ratings.head())

# Parse the genre information - genres are stored as pipe-separated strings
movies['genres_list'] = movies['genres'].str.split('|')

# Display parsed genres for verification
print("\nParsed genres for first 5 movies:")
for idx, row in movies.head().iterrows():
    print(f"{row['title']}: {row['genres_list']}")

One-Hot Encoding for Categorical Features

To convert categorical genre data into numerical vectors, we'll use One-Hot Encoding, which creates a binary column for each unique genre.

# Apply One-Hot Encoding to genre lists
mlb = MultiLabelBinarizer()
genre_matrix = mlb.fit_transform(movies['genres_list'])
genre_df = pd.DataFrame(genre_matrix, columns=mlb.classes_, index=movies['movieId'])

# Display the resulting feature matrix
print(f"Movie feature vector dimensions: {genre_df.shape}")
print("\nGenre feature matrix (first 10 rows):")
print(genre_df.head(10))

# Verify the feature names
print(f"\nUnique genres identified: {list(mlb.classes_)}")

Computing the Movie Similarity Matrix

With our feature vectors established, we can now calculate pairwise similarities between all movies in our dataset.

# Calculate cosine similarity between all movie pairs
movie_similarity = cosine_similarity(genre_df)

# Convert the similarity matrix to a DataFrame for easier manipulation
movie_similarity_df = pd.DataFrame(
    movie_similarity,
    index=genre_df.index,
    columns=genre_df.index
)

# Display matrix statistics
print(f"Similarity matrix size: {movie_similarity_df.shape}")
print(f"Similarity range: {movie_similarity_df.min().min():.4f} to {movie_similarity_df.max().max():.4f}")

# Show a sample of the similarity matrix
print("\nSample similarity matrix (first 5x5):")
print(movie_similarity_df.iloc[:5, :5])

Visualizing Similarity Patterns

Understanding the distribution of similarity scores helps us tune our recommendation parameters effectively.

import matplotlib.pyplot as plt
import seaborn as sns

# Create a heatmap visualization of the similarity matrix
plt.figure(figsize=(12, 10))
sns.heatmap(movie_similarity_df.iloc[:20, :20], 
            cmap='viridis', 
            annot=True, 
            fmt='.2f',
            xticklabels=False,
            yticklabels=False)
plt.title('Movie Similarity Matrix Heatmap (First 20 Movies)')
plt.tight_layout()
plt.savefig('similarity_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()

# Plot the distribution of similarity scores
plt.figure(figsize=(10, 6))
similarity_values = movie_similarity_df.values.flatten()
plt.hist(similarity_values, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Cosine Similarity Score')
plt.ylabel('Frequency')
plt.title('Distribution of Movie Similarity Scores')
plt.axvline(x=0.5, color='red', linestyle='--', label='Threshold (0.5)')
plt.legend()
plt.tight_layout()
plt.savefig('similarity_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

Implementing Movie-to-Movie Recommendations

Now let's create a function that, given a specific movie ID, returns the most similar movies based on our computed similarity matrix.

def recommend_content_based(movie_id, n_recommendations=5):
    """
    Recommend the N most similar movies to a given movie ID.
    
    Parameters:
    -----------
    movie_id : int
        The ID of the movie to find similar movies for
    n_recommendations : int
        Number of similar movies to return (default: 5)
    
    Returns:
    --------
    list of dict
        List of recommended movies with their similarity scores
    """
    # Check if the movie exists in our dataset
    if movie_id not in movie_similarity_df.index:
        print(f"Warning: Movie ID {movie_id} not found in dataset")
        return []
    
    # Get similarity scores for the target movie
    similarity_scores = movie_similarity_df[movie_id]
    
    # Sort by similarity (descending) and remove the movie itself
    similar_movies = similarity_scores.sort_values(ascending=False)
    similar_movies = similar_movies.drop(movie_id)
    
    # Take top N recommendations
    top_movies = similar_movies.head(n_recommendations)
    
    # Compile results with movie information
    results = []
    for mid, score in top_movies.items():
        movie_info = movies[movies['movieId'] == mid].iloc[0]
        results.append({
            'movieId': int(mid),
            'title': movie_info['title'],
            'genres': movie_info['genres'],
            'similarity_score': round(score, 4)
        })
    
    return results

# Test with Toy Story (movieId = 1)
toy_story_id = 1
toy_story_recs = recommend_content_based(toy_story_id, n_recommendations=10)

print(f"=== Movies Similar to '{movies[movies['movieId'] == toy_story_id]['title'].values[0]}' ===")
for i, rec in enumerate(toy_story_recs, 1):
    print(f"{i}. {rec['title']:50s} | Similarity: {rec['similarity_score']:.4f} | Genres: {rec['genres']}")

User-Centric Content-Based Recommendations

The true power of content-based filtering emerges when we extend it to recommend movies for individual users based on their viewing history and ratings.

def recommend_for_user_content_based(user_id, n_recommendations=10):
    """
    Generate content-based recommendations for a specific user.
    
    Parameters:
    -----------
    user_id : int
        The ID of the user to generate recommendations for
    n_recommendations : int
        Number of movies to recommend (default: 10)
    
    Returns:
    --------
    list of dict
        List of recommended movies with their aggregate scores
    """
    # Get all movies rated by this user
    user_ratings = ratings[ratings['userId'] == user_id]
    rated_movie_ids = user_ratings['movieId'].tolist()
    
    # Handle cold-start case: user has no ratings
    if len(rated_movie_ids) == 0:
        print(f"Warning: User {user_id} has no ratings in the dataset")
        return []
    
    # Identify user's favorite movies (top 5 highest-rated)
    favorite_movies = user_ratings.sort_values('rating', ascending=False)
    top_rated = favorite_movies.head(5)
    
    print(f"\nUser {user_id}'s Top Rated Movies:")
    for _, row in top_rated.iterrows():
        movie_info = movies[movies['movieId'] == row['movieId']].iloc[0]
        print(f"  • {movie_info['title']} (Rating: {row['rating']:.1f}/5.0)")
    
    # Aggregate similarity scores from all favorite movies
    candidate_scores = {}
    
    for _, row in top_rated.iterrows():
        movie_id = row['movieId']
        # Weight similarity by rating (higher ratings = higher weight)
        weight = row['rating'] / 5.0
        
        if movie_id in movie_similarity_df.index:
            # Get similar movies for this favorite
            similar = movie_similarity_df[movie_id].sort_values(ascending=False)
            similar = similar.drop(movie_id)
            
            # Accumulate weighted scores for unseen movies
            for similar_id, score in similar.items():
                if similar_id not in rated_movie_ids:  # Exclude already rated movies
                    candidate_scores[similar_id] = candidate_scores.get(similar_id, 0) + score * weight
    
    # Sort candidates by aggregate score and take top N
    sorted_candidates = sorted(candidate_scores.items(), key=lambda x: x[1], reverse=True)
    top_candidates = sorted_candidates[:n_recommendations]
    
    # Format results
    results = []
    for movie_id, score in top_candidates:
        movie_info = movies[movies['movieId'] == movie_id].iloc[0]
        results.append({
            'title': movie_info['title'],
            'genres': movie_info['genres'],
            'aggregate_score': round(score, 4)
        })
    
    return results

# Demonstrate with User 1
user_id = 1
recommendations = recommend_for_user_content_based(user_id)

print(f"\n=== Content-Based Recommendations for User {user_id} ===")
for i, rec in enumerate(recommendations, 1):
    print(f"{i}. {rec['title']:50s} | Score: {rec['aggregate_score']:.4f} | Genres: {rec['genres']}")

Advanced Feature Engineering

Our current implementation uses only genre information. Let's enhance it by incorporating additional movie features like release year and title keywords.

# Extract year from movie title
movies['year'] = movies['title'].str.extract(r'(\d{4})', expand=False)

# Create additional features
movies['title_words'] = movies['title'].str.replace(r'\s*\(\d{4}\)', '', regex=True).str.lower()

# Combine multiple features into a single text representation
movies['combined_features'] = movies['genres'] + ' ' + movies['year'].fillna('').astype(str) + ' ' + movies['title_words']

print("Enhanced feature combinations:")
print(movies[['title', 'combined_features']].head())

# Alternative approach: Use TF-IDF on combined features
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix = tfidf.fit_transform(movies['combined_features'])

# Compute similarity using TF-IDF features
tfidf_similarity = cosine_similarity(tfidf_matrix)
tfidf_similarity_df = pd.DataFrame(tfidf_similarity, 
                                   index=movies['movieId'], 
                                   columns=movies['movieId'])

print(f"\nTF-IDF similarity matrix shape: {tfidf_similarity_df.shape}")

Performance Optimization

For production systems, we need to optimize our similarity calculations and storage.

import pickle
import scipy.sparse as sp

# Save the similarity matrix to disk for faster loading
sparse_similarity = sp.csr_matrix(movie_similarity_df.values)
with open('movie_similarity_matrix.pkl', 'wb') as f:
    pickle.dump({
        'similarity_matrix': sparse_similarity,
        'movie_ids': list(movie_similarity_df.index),
        'genres': list(mlb.classes_)
    }, f)

print("Similarity matrix saved to 'movie_similarity_matrix.pkl'")

# Load and verify
with open('movie_similarity_matrix.pkl', 'rb') as f:
    loaded_data = pickle.load(f)

print(f"Loaded matrix shape: {loaded_data['similarity_matrix'].shape}")
print(f"Number of genres: {len(loaded_data['genres'])}")

Vibe Coding Implementation Guide

🔥 [Content-Based Recommendation Chant Example]