Recommendation System Fundamentals and Data Preparation
What is a Recommendation System?
A recommendation system is an information filtering technology designed to predict user preferences from vast item pools (products, movies, articles) and surface personalized suggestions. These systems power 35% of Amazon's revenue and 75% of Netflix's viewing activity, making them critical for user engagement and business growth.
Three Core Recommendation System Types
1. Content-Based Filtering
Core Concept: Recommends items similar to those a user previously liked through feature matching.
Implementation Flow:
- Extract item features (e.g., movie genres, product categories)
- Create user preference profile from historical interactions
- Calculate cosine similarity between user profile and item features
- Return top-K most similar items
Business Value:
- ✅ No cold start for new users (uses item features)
- ✅ High explainability ("Recommended because you watched...")
- ✅ Cost-effective for niche markets
- ❌ Limited cross-domain discovery
- ❌ Requires feature engineering
Vibe Coding Implementation:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Feature extraction
tfidf = TfidfVectorizer(stop_words='english')
movie_features = tfidf.fit_transform(movies['description'])
# User profile creation
user_profile = movie_features[user_id].reshape(1, -1)
# Similarity calculation
similarities = cosine_similarity(user_profile, movie_features)
recommended_movies = movies.iloc[similarities.argsort()[0][-5:]]
2. Collaborative Filtering
Core Concept: Leverages user-item interaction matrices to find patterns through collective behavior.
Matrix Factorization Approach:
- Create user-item rating matrix
- Decompose into latent user and item matrices (U, Σ, V)
- Predict ratings through matrix multiplication
- Recommend top predicted items
Business Impact:
- ✅ Cross-domain discovery ("People who like X also like Y")
- ✅ Scalable with distributed computing
- ❌ Cold start problem for new users/items
- ❌ Requires matrix inversion (O(n³) complexity)
Vibe Coding Implementation:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import train_test_split
# Dataset preparation
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
trainset = train_test_split(data, test_size=0.2)
# Model training
algo = SVD()
algo.fit(trainset)
# Prediction
predictions = algo.test(testset)
recommendations = [algo.predict(u, i) for i in all_items]
3. Hybrid Recommendation Systems
Core Concept: Combines content-based and collaborative filtering through weighted fusion.
Implementation Strategies:
- Weighted sum:
prediction = α*content_pred + (1-α)*collab_pred - Feature combination: Joint matrix factorization with content features
- Cascade approach: Content for cold start, collaborative for refinement
Business Advantage:
- ✅ 20-40% improvement in precision@K vs single methods
- ✅ Mitigates cold start through content signals
- ✅ Enables personalized discovery
- ❌ Increased model complexity
- ❌ Requires careful parameter tuning
Vibe Coding Implementation:
# Hybrid model using LightFM
from lightfm import LightFM
from lightfm.data import Dataset
# Create hybrid dataset
hybrid_data = Dataset()
hybrid_data.add_users(users)
hybrid_data.add_items(items)
hybrid_data.add_ratings(ratings)
hybrid_data.add_user_item_interactions(user_item_matrix)
hybrid_data.add_item_features(item_features)
# Model training
model = LightFM(loss='warp')
model.fit(hybrid_data.build(), epochs=20)
# Hybrid prediction
pred = model.predict(user_id, item_id, timestamp)
Recommendation System Evaluation Metrics
| Metric | Definition | Business Impact | Implementation |
|---------------|-------------------------------------|------------------------------------------|----------------|
| Precision@K | % of top-K recommendations actually liked | Directly correlates with conversion rates | precision_at_k(predictions, k=10) |
| Recall@K | % of actual liked items in top-K | Measures discovery effectiveness | recall_at_k(predictions, k=10) |
| RMSE | Root Mean Squared Error | Predictive accuracy for rating systems | np.sqrt(mean_squared_error()) |
| MAP@K | Mean Average Precision | Balances precision and recall | average_precision_score() |
Business ROI Analysis:
- 1% improvement in precision@10 = $1.2M annual revenue for e-commerce
- 5% increase in recall@20 = 30% higher user engagement
- RMSE < 0.5 = 15% reduction in manual curation costs
MovieLens Dataset Preparation
Dataset Acquisition
import pandas as pd
import numpy as np
# Direct download from GroupLens
urls = {
'movies': 'https://files.grouplens.org/datasets/movielens/ml-latest-small/movies.csv',
'ratings': 'https://files.grouplens.org/datasets/movielens/ml-latest-small/ratings.csv'
}
movies = pd.read_csv(urls['movies'])
ratings = pd.read_csv(urls['ratings'])
print(f"=== Dataset Overview ===")
print(f"Movies: {len(movies):,} | Users: {ratings['userId'].nunique():,} | Ratings: {len(ratings):,}")
Feature Engineering
# Genre vectorization
movies['genres_list'] = movies['genres'].str.split('|')
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
genre_matrix = mlb.fit_transform(movies['genres_list'])
genre_df = pd.DataFrame(genre_matrix, columns=mlb.classes_, index=movies['movieId'])
# Temporal feature extraction
ratings['timestamp'] = pd.to_datetime(ratings['timestamp'], unit='s')
ratings['year'] = ratings['timestamp'].dt.year
ratings['month'] = ratings['timestamp'].dt.month
Temporal Data Splitting
# Time-based split by user
train_data = []
test_data = []
for user_id, user_ratings in ratings.groupby('userId'):
user_ratings = user_ratings.sort_values('timestamp')
split_idx = int(len(user_ratings) * 0.8)
train_data.append(user_ratings.iloc[:split_idx])
test_data.append(user_ratings.iloc[split_idx:])
train_df = pd.concat(train_data)
test_df = pd.concat(test_data)
print(f"=== Temporal Split ===")
print(f"Training: {len(train_df):,} ratings | Testing: {len(test_df):,} ratings")
Implementation Roadmap
Phase 1: Content-Based System
- Feature extraction pipeline
- User profile construction
- Similarity calculation engine
- Recommendation ranking
Phase 2: Collaborative Filtering
- Matrix factorization implementation
- Cold start mitigation strategies
- Scalability optimization
- Real-time prediction system
Phase 3: Hybrid System Development
- Model fusion architecture
- Weight optimization techniques
- A/B testing framework
- Production deployment
Transition to Advanced Topics
This chapter established the foundational concepts and data infrastructure for recommendation systems. In the next chapter, we'll implement collaborative filtering using matrix factorization techniques, focusing on the Singular Value Decomposition (SVD) algorithm. We'll explore how to transform user-item interactions into latent feature spaces, handle sparse matrices, and implement real-time prediction systems. This will enable us to build production-grade recommendation engines capable of processing millions of interactions while maintaining sub-100ms latency. The hybrid system development phase will then combine these approaches to create robust, scalable recommendation pipelines with measurable business impact.