Recommendation System Fundamentals and Data Preparation

What is a Recommendation System?

A recommendation system is an information filtering technology designed to predict user preferences from vast item pools (products, movies, articles) and surface personalized suggestions. These systems power 35% of Amazon's revenue and 75% of Netflix's viewing activity, making them critical for user engagement and business growth.

Three Core Recommendation System Types

1. Content-Based Filtering

Core Concept: Recommends items similar to those a user previously liked through feature matching.

Implementation Flow:

  1. Extract item features (e.g., movie genres, product categories)
  2. Create user preference profile from historical interactions
  3. Calculate cosine similarity between user profile and item features
  4. Return top-K most similar items

Business Value:

  • ✅ No cold start for new users (uses item features)
  • ✅ High explainability ("Recommended because you watched...")
  • ✅ Cost-effective for niche markets
  • ❌ Limited cross-domain discovery
  • ❌ Requires feature engineering

Vibe Coding Implementation:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Feature extraction
tfidf = TfidfVectorizer(stop_words='english')
movie_features = tfidf.fit_transform(movies['description'])

# User profile creation
user_profile = movie_features[user_id].reshape(1, -1)

# Similarity calculation
similarities = cosine_similarity(user_profile, movie_features)
recommended_movies = movies.iloc[similarities.argsort()[0][-5:]]

2. Collaborative Filtering

Core Concept: Leverages user-item interaction matrices to find patterns through collective behavior.

Matrix Factorization Approach:

  1. Create user-item rating matrix
  2. Decompose into latent user and item matrices (U, Σ, V)
  3. Predict ratings through matrix multiplication
  4. Recommend top predicted items

Business Impact:

  • ✅ Cross-domain discovery ("People who like X also like Y")
  • ✅ Scalable with distributed computing
  • ❌ Cold start problem for new users/items
  • ❌ Requires matrix inversion (O(n³) complexity)

Vibe Coding Implementation:

from surprise import SVD
from surprise import Dataset
from surprise.model_selection import train_test_split

# Dataset preparation
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
trainset = train_test_split(data, test_size=0.2)

# Model training
algo = SVD()
algo.fit(trainset)

# Prediction
predictions = algo.test(testset)
recommendations = [algo.predict(u, i) for i in all_items]

3. Hybrid Recommendation Systems

Core Concept: Combines content-based and collaborative filtering through weighted fusion.

Implementation Strategies:

  1. Weighted sum: prediction = α*content_pred + (1-α)*collab_pred
  2. Feature combination: Joint matrix factorization with content features
  3. Cascade approach: Content for cold start, collaborative for refinement

Business Advantage:

  • ✅ 20-40% improvement in precision@K vs single methods
  • ✅ Mitigates cold start through content signals
  • ✅ Enables personalized discovery
  • ❌ Increased model complexity
  • ❌ Requires careful parameter tuning

Vibe Coding Implementation:

# Hybrid model using LightFM
from lightfm import LightFM
from lightfm.data import Dataset

# Create hybrid dataset
hybrid_data = Dataset()
hybrid_data.add_users(users)
hybrid_data.add_items(items)
hybrid_data.add_ratings(ratings)
hybrid_data.add_user_item_interactions(user_item_matrix)
hybrid_data.add_item_features(item_features)

# Model training
model = LightFM(loss='warp')
model.fit(hybrid_data.build(), epochs=20)

# Hybrid prediction
pred = model.predict(user_id, item_id, timestamp)

Recommendation System Evaluation Metrics

| Metric | Definition | Business Impact | Implementation | |---------------|-------------------------------------|------------------------------------------|----------------| | Precision@K | % of top-K recommendations actually liked | Directly correlates with conversion rates | precision_at_k(predictions, k=10) | | Recall@K | % of actual liked items in top-K | Measures discovery effectiveness | recall_at_k(predictions, k=10) | | RMSE | Root Mean Squared Error | Predictive accuracy for rating systems | np.sqrt(mean_squared_error()) | | MAP@K | Mean Average Precision | Balances precision and recall | average_precision_score() |

Business ROI Analysis:

  • 1% improvement in precision@10 = $1.2M annual revenue for e-commerce
  • 5% increase in recall@20 = 30% higher user engagement
  • RMSE < 0.5 = 15% reduction in manual curation costs

MovieLens Dataset Preparation

Dataset Acquisition

import pandas as pd
import numpy as np

# Direct download from GroupLens
urls = {
    'movies': 'https://files.grouplens.org/datasets/movielens/ml-latest-small/movies.csv',
    'ratings': 'https://files.grouplens.org/datasets/movielens/ml-latest-small/ratings.csv'
}

movies = pd.read_csv(urls['movies'])
ratings = pd.read_csv(urls['ratings'])

print(f"=== Dataset Overview ===")
print(f"Movies: {len(movies):,} | Users: {ratings['userId'].nunique():,} | Ratings: {len(ratings):,}")

Feature Engineering

# Genre vectorization
movies['genres_list'] = movies['genres'].str.split('|')
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
genre_matrix = mlb.fit_transform(movies['genres_list'])
genre_df = pd.DataFrame(genre_matrix, columns=mlb.classes_, index=movies['movieId'])

# Temporal feature extraction
ratings['timestamp'] = pd.to_datetime(ratings['timestamp'], unit='s')
ratings['year'] = ratings['timestamp'].dt.year
ratings['month'] = ratings['timestamp'].dt.month

Temporal Data Splitting

# Time-based split by user
train_data = []
test_data = []

for user_id, user_ratings in ratings.groupby('userId'):
    user_ratings = user_ratings.sort_values('timestamp')
    split_idx = int(len(user_ratings) * 0.8)
    train_data.append(user_ratings.iloc[:split_idx])
    test_data.append(user_ratings.iloc[split_idx:])

train_df = pd.concat(train_data)
test_df = pd.concat(test_data)

print(f"=== Temporal Split ===")
print(f"Training: {len(train_df):,} ratings | Testing: {len(test_df):,} ratings")

Implementation Roadmap

Phase 1: Content-Based System

  1. Feature extraction pipeline
  2. User profile construction
  3. Similarity calculation engine
  4. Recommendation ranking

Phase 2: Collaborative Filtering

  1. Matrix factorization implementation
  2. Cold start mitigation strategies
  3. Scalability optimization
  4. Real-time prediction system

Phase 3: Hybrid System Development

  1. Model fusion architecture
  2. Weight optimization techniques
  3. A/B testing framework
  4. Production deployment

Transition to Advanced Topics

This chapter established the foundational concepts and data infrastructure for recommendation systems. In the next chapter, we'll implement collaborative filtering using matrix factorization techniques, focusing on the Singular Value Decomposition (SVD) algorithm. We'll explore how to transform user-item interactions into latent feature spaces, handle sparse matrices, and implement real-time prediction systems. This will enable us to build production-grade recommendation engines capable of processing millions of interactions while maintaining sub-100ms latency. The hybrid system development phase will then combine these approaches to create robust, scalable recommendation pipelines with measurable business impact.

Member Exclusive Free Tutorial

This chapter is free exclusive content for registered members! Please login or register to unlock immediately.

Login / Register Now