Recommendation System Evaluation and Quality Analysis

Recommendation engines differ fundamentally from generic machine‑learning models because their success is not judged solely by prediction accuracy. In addition to precision, we must also consider diversity, novelty, and coverage—factors that directly influence user satisfaction, engagement, and ultimately revenue. A system that only surfaces the most popular movies may achieve high accuracy, but users quickly feel bored and churn.

1. Offline Evaluation Metrics

Offline evaluation uses historical data to estimate how a model would perform in production. It is fast, inexpensive, and allows rapid iteration, but it cannot capture real‑world user behavior. Below we detail the most common offline metrics, explain why they matter, and show step‑by‑step code that can be run in a Vibe Coding environment.

1.1 Precision@K and Recall@K

What – Precision@K measures the proportion of the top‑K recommended items that the user actually likes. Recall@K measures the proportion of all items the user likes that appear in the top‑K list.

Why – High precision means users see relevant items, reducing friction and increasing click‑through rates. High recall ensures the system surfaces a broad set of user interests, preventing the “filter bubble” effect. Together they balance relevance and coverage.

How – The following Python functions compute these metrics for a single user. They assume a rating threshold of 4+ indicates a “liked” item.

def precision_at_k(recommended_ids, relevant_ids, k):
    """
    Compute Precision@K.
    
    recommended_ids: List of item IDs returned by the model, sorted by score.
    relevant_ids: Set of item IDs the user actually liked.
    k: Number of top items to consider.
    """
    top_k = recommended_ids[:k]
    hits = len(set(top_k) & relevant_ids)
    return hits / k if k else 0.0


def recall_at_k(recommended_ids, relevant_ids, k):
    """
    Compute Recall@K.
    
    recommended_ids: List of item IDs returned by the model, sorted by score.
    relevant_ids: Set of item IDs the user actually liked.
    k: Number of top items to consider.
    """
    top_k = recommended_ids[:k]
    hits = len(set(top_k) & relevant_ids)
    return hits / len(relevant_ids) if relevant_ids else 0.0

Evaluation Loop – For each user in a test split, we generate recommendations, convert titles to IDs, and compute metrics across several K values.

import pandas as pd

def evaluate_user(user_id, model, n_recs=20):
    """
    Evaluate a single user’s recommendation quality.
    
    user_id: Identifier of the user to evaluate.
    model: Trained recommendation model with a `recommend` method.
    n_recs: Number of items to recommend.
    """
    # 1. Identify ground‑truth liked items from the test set
    user_test = test_df[test_df['userId'] == user_id]
    relevant = set(user_test[user_test['rating'] >= 4]['movieId'])
    if not relevant:
        return None
    
    # 2. Generate recommendations (model returns a list of dicts with 'title')
    recs = model.recommend(user_id, n_recs)
    
    # 3. Map titles back to movie IDs
    rec_ids = []
    for rec in recs:
        movie = movies[movies['title'] == rec['title']]
        if not movie.empty:
            rec_ids.append(movie.iloc[0]['movieId'])
    
    # 4. Compute metrics for multiple K values
    ks = [1, 3, 5, 10, 20]
    rows = []
    for k in ks:
        rows.append({
            'K': k,
            'Precision': precision_at_k(rec_ids, relevant, k),
            'Recall': recall_at_k(rec_ids, relevant, k)
        })
    return pd.DataFrame(rows)

Aggregating Results – Run the evaluation for a batch of users and compute the mean across users.

all_results = []
for uid in range(1, 21):          # Evaluate first 20 users
    df = evaluate_user(uid, hybrid_model)
    if df is not None:
        df['UserId'] = uid
        all_results.append(df)

if all_results:
    combined = pd.concat(all_results)
    avg = combined.groupby('K')[['Precision', 'Recall']].mean()
    print("\nAverage Offline Evaluation:")
    print(avg.to_string())

1.2 Diversity

What – Diversity measures how varied the genres (or any categorical attribute) of the recommended items are.

Why – A diverse recommendation list keeps users curious and reduces the risk of monotony. For content platforms, diversity can increase overall session length and cross‑sell opportunities.

How – We compute the ratio of unique genres in the recommendation list to the total number of genres in the catalog.

def diversity_score(recs):
    """
    Compute diversity as the proportion of distinct genres in the recommendation list.
    
    recs: List of recommendation dicts, each containing a 'genres' string like 'Action|Comedy'.
    """
    genre_set = set()
    for rec in recs:
        for g in rec['genres'].split('|'):
            genre_set.add(g)
    
    total_genres = len(movies['genres'].str.split('|').explode().unique())
    return len(genre_set) / total_genres if total_genres else 0.0

Comparison Example – Evaluate diversity for content‑based and hybrid recommendations.

cb_recs = content_based_model.recommend(1, 20)
hyb_recs = hybrid_model.recommend(1, 20)

print(f"Content‑Based Diversity: {diversity_score(cb_recs):.2%}")
print(f"Hybrid Diversity: {diversity_score(hyb_recs):.2%}")

1.3 Novelty

What – Novelty quantifies how many recommended items are not popular.

Why – Recommending only blockbuster titles can quickly exhaust a user’s interest. Novelty encourages discovery of hidden gems, which can increase user loyalty and reduce churn.

How – We define novelty as the proportion of recommended items whose popularity score falls below a threshold. Popularity is measured by the number of ratings relative to the most popular item.

def novelty_score(recs, threshold=0.2):
    """
    Compute novelty as the fraction of recommended items below a popularity threshold.
    
    threshold: Popularity ratio below which an item is considered novel.
    """
    # Pre‑compute popularity for each movie
    pop_counts = ratings.groupby('movieId').size()
    max_count = pop_counts.max()
    
    novel_count = 0
    for rec in recs:
        movie_id = movies[movies['title'] == rec['title']]['movieId'].iloc[0]
        pop_ratio = pop_counts.get(movie_id, 0) / max_count
        if pop_ratio < threshold:
            novel_count += 1
    return novel_count / len(recs) if recs else 0.0

Comparison Example –

print(f"Content‑Based Novelty: {novelty_score(cb_recs):.2%}")
print(f"Hybrid Novelty: {novelty_score(hyb_recs):.2%}")

1.4 Coverage

What – Coverage measures the proportion of the entire catalog that the system can recommend.

Why – High coverage ensures that the system can serve a wide range of users and reduces the risk of “cold‑start” for niche items.

How – We aggregate the set of unique items recommended across a sample of users and divide by the total catalog size.

def coverage_score(user_ids, model, n_recs=10):
    """
    Compute coverage across a set of users.
    
    user_ids: Iterable of user identifiers.
    model: Recommendation model with a `recommend` method.
    n_recs: Number of items to recommend per user.
    """
    all_recs = set()
    for uid in user_ids:
        recs = model.recommend(uid, n_recs)
        for rec in recs:
            movie_id = movies[movies['title'] == rec['title']]['movieId'].iloc[0]
            all_recs.add(movie_id)
    return len(all_recs) / len(movies)

Usage –

# Compute coverage for the first 50 users
cov = coverage_score(range(1, 51), hybrid_model)
print(f"Hybrid Coverage: {cov:.2%}")

2. Building an Evaluation Dashboard

A visual dashboard helps stakeholders quickly compare algorithms and track key metrics over time. Below is a minimal example using Matplotlib and Seaborn that plots Precision@5, Diversity, and Novelty for three methods.

import matplotlib.pyplot as plt
import seaborn as sns

methods = ['Content‑Based', 'User‑Based CF', 'Hybrid']
metrics = {
    'Precision@5': [],
    'Diversity': [],
    'Novelty': []
}

# Helper to convert recommendation list to movie IDs
def rec_ids_from_titles(recs):
    return [movies[movies['title'] == r['title']]['movieId'].iloc[0] for r in recs]

# Evaluate each method for a single user (e.g., user 1)
for func in [
    lambda uid: content_based_model.recommend(uid, 20),
    lambda uid: user_based_cf.recommend(uid, 20),
    lambda uid: hybrid_model.recommend(uid, 20)
]:
    recs = func(1)
    ids = rec_ids_from_titles(recs)
    relevant = set(ratings[(ratings['userId'] == 1) & (ratings['rating'] >= 4)]['movieId'])
    
    metrics['Precision@5'].append(
        precision_at_k(ids, relevant, 5)
    )
    metrics['Diversity'].append(diversity_score(recs))
    metrics['Novelty'].append(novelty_score(recs))

# Plotting
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for i, metric in enumerate(['Precision@5', 'Diversity', 'Novelty']):
    sns.barplot(x=methods, y=metrics[metric], ax=axes[i], palette='viridis')
    axes[i].set_title(metric)
    axes[i].set_ylim(0, 1)
    for j, val in enumerate(metrics[metric]):
        axes[i].text(j, val + 0.02, f'{val:.2%}', ha='center')
plt.tight_layout()
plt.show()

Business Insight – By inspecting the bar chart, a product manager can decide whether to prioritize a more diverse hybrid model or a higher‑precision content‑based model based on the target KPI (e.g., click‑through vs. session length).

3. Online Evaluation: A/B Testing Framework

Offline metrics are necessary but not sufficient. The ultimate test is how users react in real time. A/B testing allows us to compare two or more recommendation algorithms under identical traffic conditions.

3.1 A/B Test Design Blueprint

What – Define the experiment, variants, traffic allocation, metrics, and duration.

Why – A/B tests provide statistically valid evidence of which algorithm drives better business outcomes (e.g., higher conversion rate or longer session).

How – The following configuration is a template that can be deployed in a Vibe Coding environment or integrated into a production feature flag system.

ab_test_config = {
    'experiment_name': 'Recommendation Algorithm A/B Test',
    'variants': [
        {
            'name': 'Control (Content‑Based)',
            'algorithm': 'content_based',
            'traffic': 0.5  # 50% of users
        },
        {
            'name': 'Experiment (Hybrid)',
            'algorithm': 'hybrid',
            'traffic': 0.5  # 50% of users
        }
    ],
    'metrics': [
        'click_through_rate',
        'conversion_rate',
        'average_session_duration',
        'diversity_click_rate'
    ],
    'minimum_sample_size': 1000,
    'duration_days': 14
}

print("=== A/B Test Design ===")
for variant in ab_test_config['variants']:
    print(f"{variant['name']}: {variant['traffic']*100:.0f}% traffic")
print(f"\nMetrics: {', '.join(ab_test_config['metrics'])}")
print(f"Duration: {ab_test_config['duration_days']} days")

3.2 Implementing the Experiment

Feature Flagging – Use a lightweight flag service (e.g., LaunchDarkly, Optimizely, or a custom Redis key) to assign users to variants deterministically.
Metric Collection – Instrument the recommendation endpoint to log impressions, clicks, and conversions to a time‑series database (e.g., InfluxDB, TimescaleDB).
Statistical Analysis – After the experiment, compute lift and confidence intervals using a two‑sample t‑test or Bayesian A/B testing library.

Sample Python Snippet (Post‑Experiment Analysis)

import pandas as pd
from scipy import stats

# Load logged metrics
df = pd.read_csv('ab_test_logs.csv')

# Compute lift for CTR
ctr_control = df[df['variant'] == 'Control']['clicks'].sum() / df[df['variant'] == 'Control']['impressions'].sum()
ctr_experiment = df[df['variant'] == 'Experiment']['clicks'].sum() / df[df['variant'] == 'Experiment']['impressions'].sum()
lift = (ctr_experiment - ctr_control) / ctr_control

print(f"CTR Lift: {lift:.2%}")

# Two‑sample t‑test
t_stat, p_val = stats.ttest_ind(
    df[df['variant'] == 'Control']['session_duration'],
    df[df['variant'] == 'Experiment']['session_duration'],
    equal_var=False
)
print(f"Session Duration p‑value: {p_val:.4f}")

4. Leveraging Vibe Coding for Evaluation

Vibe Coding is a low‑code platform that lets data scientists and developers prototype, test, and deploy recommendation pipelines with minimal boilerplate. Below is a “chant” (prompt) you can paste into Vibe Coding’s AI assistant to trigger a full evaluation script.

🔥 Recommendation Evaluation Prompt
`Please perform a comprehensive evaluation of my recommendation system:

Compute Precision@1, @3, @5, @10.

Compute Recall@5, @10.

Calculate the genre diversity score.

Calculate the novelty score (ratio of non‑popular items).

Generate a radar chart comparing Content‑Based, User‑Based CF, and Hybrid methods.

Output a detailed evaluation report.`

When you run this prompt, Vibe Coding will automatically:

Load your dataset and trained models.
Execute the metric functions defined above.
Render visualizations inline.
Export a PDF report that can be shared with stakeholders.

5. Summary of Key Takeaways

In this chapter you learned to:

Measure Precision@K / Recall@K – the gold standard for relevance.
Quantify Diversity – ensuring a mix of genres or categories.
Assess Novelty – promoting discovery of less‑known items.
Calculate Coverage – guaranteeing a wide reach across the catalog.
Design and Run A/B Tests – validating hypotheses with real users.
Integrate Vibe Coding – accelerating experimentation and reporting.

These metrics and practices form the backbone of a data‑driven recommendation strategy that balances business goals (e.g., revenue, retention) with user experience.

6. Transition to the Next Chapter

Having established a rigorous evaluation framework, the next logical step is to expose your recommendation engine as a scalable, production‑ready API. In the upcoming chapter, we will walk through packaging the model into a RESTful service, deploying it on a cloud platform with auto‑scaling, and setting up continuous monitoring and retraining pipelines. By the end of that module, you will have a fully operational recommendation service that can be integrated into any web or mobile application, ready to deliver personalized content to millions of users while continuously improving through automated evaluation loops.