Classification: Customer Churn Prediction

Why Classification?

In the previous chapter, we predicted continuous values (house prices). But a huge class of real-world problems requires predicting categories rather than numbers:

Is this transaction fraudulent? → Yes / No
Will this customer churn next month? → Churn / No Churn
Is this MRI tumor benign or malignant? → Benign / Malignant

These are all classification problems. Unlike regression which answers "how much?", classification answers "which category?"

Why this matters for your career:

Classification is the most common ML task in business — fraud detection, churn prediction, medical diagnosis, and credit scoring are all classification problems
Classification requires different evaluation metrics (precision, recall, F1) — mastering these separates serious ML engineers from beginners
Real-world datasets are rarely balanced — learning how to handle imbalance is a critical skill
Classification models are used in every industry: finance, healthcare, e-commerce, cybersecurity

Business Scenario: E-Commerce Customer Churn

Imagine you run a subscription-based e-commerce platform. Every month, some customers cancel. If you could predict which customers are about to churn, you could automatically send them a discount coupon or acare messagebefore they leave, dramatically reducing churn rate.

What Is Churn Prediction?

Churn prediction answers: "Given this customer's behavior history (tenure, spending, complaints, contract status), what is the probability they will cancel next month?"

| Feature | Description | Why It Matters | |:--------|:------------|:---------------| | Tenure | Months since signup | Longer tenure = more loyal | | Monthly charges | Amount billed per month | Higher charges = higher churn risk | | Support tickets | Number of customer service visits | More tickets = frustration | | Contract status | Whether customer has a long-term contract | Contract = locked in | | Complaints | Number of formal complaints | Complaints = immediate red flag |

Loading the Customer Dataset

We will use a simulated e-commerce dataset with 1,000 customers:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Generate simulated customer data
np.random.seed(42)
n_customers = 1000

df = pd.DataFrame({
    'tenure_months': np.random.randint(1, 72, n_customers),        # Months of tenure
    'monthly_charges': np.random.uniform(30, 120, n_customers),    # Monthly charge
    'total_charges': np.random.uniform(100, 8000, n_customers),    # Total charges
    'num_support_tickets': np.random.randint(0, 10, n_customers),  # Support tickets
    'has_contract': np.random.choice([0, 1], n_customers),         # Has contract
    'avg_order_value': np.random.uniform(10, 200, n_customers),    # Avg order value
    'num_complaints': np.random.randint(0, 5, n_customers),        # Complaints
})

# Simulate churn labels (higher tenure + contract = less likely to churn)
churn_prob = (
    0.4
    - 0.005 * df['tenure_months']
    + 0.003 * df['num_support_tickets']
    + 0.05 * df['num_complaints']
    - 0.15 * df['has_contract']
    + np.random.normal(0, 0.1, n_customers)
)
df['churn'] = (churn_prob > 0.5).astype(int)

print(f"Dataset size: {df.shape}")
print(f"Churn rate: {df['churn'].mean()*100:.1f}%")

Logistic Regression

Do not be fooled by the name! Despite having "Regression" in it, Logistic Regression is a classification algorithm.

How It Works

The idea: calculate a linear combination (like regression), then pass it through a Sigmoid function to squash the output into a probability between 0 and 1.

$$P(y=1) = \frac{1}{1 + e^{-(w_1x_1 + w_2x_2 + ... + b)}}$$

| Input | Calculation | Output | |:------|:------------|:-------| | z = large positive (e.g. 5) | Sigmoid(5) = 0.993 | 99% probability of churn | | z = 0 | Sigmoid(0) = 0.500 | Exactly 50/50 — uncertain | | z = large negative (e.g. -5) | Sigmoid(-5) = 0.007 | 1% probability of churn |

Decision Rule

If probability $\geq 0.5$, predict Churn (1)
If probability $< 0.5$, predict No Churn (0)

$$P(y=1) = \frac{1}{1 + e^{-(w_1x_1 + w_2x_2 + ... + b)}}$$

If probability $\geq 0.5$, predict Churn (1)
If probability $< 0.5$, predict No Churn (0)

Training the Logistic Regression Model

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Split data
X = df.drop('churn', axis=1)
y = df['churn']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features (required for Logistic Regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
log_model = LogisticRegression()
log_model.fit(X_train_scaled, y_train)

# Predict
y_pred = log_model.predict(X_test_scaled)
y_pred_proba = log_model.predict_proba(X_test_scaled)[:, 1]

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Classification Evaluation Metrics

Classification evaluation is more complex than regression. You cannot just look at accuracy — you need to understand where the model makes mistakes and what kind of mistakes it makes.

The Confusion Matrix

The Confusion Matrix is the foundation of classification evaluation. It shows four outcomes:

| | Actual Positive | Actual Negative | |:---|:--------------:|:--------------:| | Predicted Positive | True Positive (TP) | False Positive (FP) | | Predicted Negative | False Negative (FN) | True Negative (TN) |

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# Manual interpretation
TN, FP, FN, TP = cm.ravel()
print(f"True Negatives (TN): {TN}  — Correctly predicted no churn")
print(f"False Positives (FP): {FP}  — Falsely flagged as churn (wasted coupon)")
print(f"False Negatives (FN): {FN}  — Missed churn (most severe!)")
print(f"True Positives (TP): {TP}  — Correctly predicted churn")

Precision vs. Recall

Precision: Of all customers predicted to churn, how many actually churned? $$Precision = \frac{TP}{TP + FP}$$

Low precision = too many false alarms = wasted coupon costs

Recall: Of all customers who actually churned, how many did we catch? $$Recall = \frac{TP}{TP + FN}$$

Low recall = too many missed churners = lost customers

F1 Score: The harmonic mean of Precision and Recall — one number that balances both $$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$$

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['No Churn', 'Churn']))

Why Accuracy Can Be Misleading

If only 5% of customers churn, a model that predicts "No Churn" for everyone achieves 95% accuracy — but is completely useless! It never catches anyone who is about to leave.

When data is imbalanced (few churners), focus on Recall and Precision, not Accuracy. This is one of the most important lessons in machine learning.

Decision Tree

A Decision Tree is another highly intuitive classification algorithm. It works like a "if...then..." quiz game where each question splits the data into smaller groups.

Has Contract?
├── No: Frequent Complaints?
│   ├── Yes: Predict Churn ❌
│   └── No: Tenure > 12 months?
│       ├── Yes: Predict No Churn ✅
│       └── No: Predict Churn ❌
└── Yes: Predict No Churn ✅

Why Decision Trees are popular:

Completely transparent — you can read and explain every decision
No feature scaling needed — works directly with raw data
Handles both numeric and categorical data naturally
Built-in feature importance tells you what matters most

from sklearn.tree import DecisionTreeClassifier

# Decision Tree does not require scaling
dt_model = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_model.fit(X_train, y_train)

# Predict and evaluate
y_pred_dt = dt_model.predict(X_test)
print(f"Decision Tree Accuracy: {accuracy_score(y_test, y_pred_dt):.4f}")
print("\nDecision Tree Classification Report:")
print(classification_report(y_test, y_pred_dt, target_names=['No Churn', 'Churn']))

max_depth=5 limits the tree depth, preventing overfitting.

Feature Importance Analysis

One of the most powerful features of Decision Trees (and tree-based models in general) is that they can tell you which features matter most for making predictions:

feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': dt_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("Feature Importances:")
print(feature_importance)

# Plot feature importance
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importance Analysis')
plt.gca().invert_yaxis()
plt.show()

Interpreting Feature Importance

The output shows which features the model relied on most:

num_complaints: Customers who complain are far more likely to churn
tenure_months: Long-time customers are much more loyal
has_contract: Contract customers rarely churn
monthly_charges: Higher bills correlate with higher churn risk

This information is pure gold for business decisions: you know exactly which customer behaviors to monitor and which interventions will have the biggest impact.

Using Vibe Coding for Classification

🔥 Vibe Coding Prompt for Classification "I have a customer_data.csv with columns: tenure_months, monthly_charges, total_charges, num_support_tickets, has_contract, avg_order_value, num_complaints. Please help me:

Train both LogisticRegression and DecisionTreeClassifier churn prediction models

Compare their Accuracy, Precision, Recall, and F1 Score

Plot the Confusion Matrix for both models

Output decision tree feature importance ranking

Recommend which model is better for this task and explain why"

Summary

In this chapter, you learned:

What classification is: Predicting a category (Churn/No Churn, Fraud/Legitimate) instead of a continuous value
Why it matters: Classification powers fraud detection, churn prediction, medical diagnosis, spam filtering — most real-world ML applications
How Logistic Regression works: Calculate a linear combination, then pass it through the Sigmoid function to get a probability between 0 and 1

Key takeaways:

Classification is fundamentally different from regression — it predicts categories, not numbers
Accuracy is not enough — with imbalanced data, a useless model can still hit 95% accuracy
Confusion Matrix is the foundation: TP, TN, FP, FN tell you exactly where the model errs
Precision vs Recall is a trade-off — optimize for Recall when missing a churner is costly, optimize for Precision when false alarms are expensive
F1 Score balances Precision and Recall into one metric
Decision Trees are transparent and interpretable — you can read the if-then rules directly
Feature Importance reveals which factors drive predictions, enabling data-driven business decisions

What Is Next: Random Forest and Overfitting

The next chapter takes on the biggest challenge in machine learning: overfitting. You will learn why a deep decision tree memorizes noise instead of learning patterns, and how Random Forest solves this by combining hundreds of trees into one robust, stable model. We will also cover cross-validation, hyperparameter tuning, and model comparison.