Classification: Customer Churn Prediction
Why Classification?
In the previous chapter, we predicted continuous values (house prices). But a huge class of real-world problems requires predicting categories rather than numbers:
- Is this transaction fraudulent? → Yes / No
- Will this customer churn next month? → Churn / No Churn
- Is this MRI tumor benign or malignant? → Benign / Malignant
These are all classification problems. Unlike regression which answers "how much?", classification answers "which category?"
Why this matters for your career:
- Classification is the most common ML task in business — fraud detection, churn prediction, medical diagnosis, and credit scoring are all classification problems
- Classification requires different evaluation metrics (precision, recall, F1) — mastering these separates serious ML engineers from beginners
- Real-world datasets are rarely balanced — learning how to handle imbalance is a critical skill
- Classification models are used in every industry: finance, healthcare, e-commerce, cybersecurity
Business Scenario: E-Commerce Customer Churn
Imagine you run a subscription-based e-commerce platform. Every month, some customers cancel. If you could predict which customers are about to churn, you could automatically send them a discount coupon or acare messagebefore they leave, dramatically reducing churn rate.
What Is Churn Prediction?
Churn prediction answers: "Given this customer's behavior history (tenure, spending, complaints, contract status), what is the probability they will cancel next month?"
| Feature | Description | Why It Matters | |:--------|:------------|:---------------| | Tenure | Months since signup | Longer tenure = more loyal | | Monthly charges | Amount billed per month | Higher charges = higher churn risk | | Support tickets | Number of customer service visits | More tickets = frustration | | Contract status | Whether customer has a long-term contract | Contract = locked in | | Complaints | Number of formal complaints | Complaints = immediate red flag |
Loading the Customer Dataset
We will use a simulated e-commerce dataset with 1,000 customers:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Generate simulated customer data
np.random.seed(42)
n_customers = 1000
df = pd.DataFrame({
'tenure_months': np.random.randint(1, 72, n_customers), # Months of tenure
'monthly_charges': np.random.uniform(30, 120, n_customers), # Monthly charge
'total_charges': np.random.uniform(100, 8000, n_customers), # Total charges
'num_support_tickets': np.random.randint(0, 10, n_customers), # Support tickets
'has_contract': np.random.choice([0, 1], n_customers), # Has contract
'avg_order_value': np.random.uniform(10, 200, n_customers), # Avg order value
'num_complaints': np.random.randint(0, 5, n_customers), # Complaints
})
# Simulate churn labels (higher tenure + contract = less likely to churn)
churn_prob = (
0.4
- 0.005 * df['tenure_months']
+ 0.003 * df['num_support_tickets']
+ 0.05 * df['num_complaints']
- 0.15 * df['has_contract']
+ np.random.normal(0, 0.1, n_customers)
)
df['churn'] = (churn_prob > 0.5).astype(int)
print(f"Dataset size: {df.shape}")
print(f"Churn rate: {df['churn'].mean()*100:.1f}%")
Logistic Regression
Do not be fooled by the name! Despite having "Regression" in it, Logistic Regression is a classification algorithm.
How It Works
The idea: calculate a linear combination (like regression), then pass it through a Sigmoid function to squash the output into a probability between 0 and 1.
$$P(y=1) = \frac{1}{1 + e^{-(w_1x_1 + w_2x_2 + ... + b)}}$$
| Input | Calculation | Output | |:------|:------------|:-------| | z = large positive (e.g. 5) | Sigmoid(5) = 0.993 | 99% probability of churn | | z = 0 | Sigmoid(0) = 0.500 | Exactly 50/50 — uncertain | | z = large negative (e.g. -5) | Sigmoid(-5) = 0.007 | 1% probability of churn |
Decision Rule
- If probability $\geq 0.5$, predict Churn (1)
- If probability $< 0.5$, predict No Churn (0)
$$P(y=1) = \frac{1}{1 + e^{-(w_1x_1 + w_2x_2 + ... + b)}}$$
- If probability $\geq 0.5$, predict Churn (1)
- If probability $< 0.5$, predict No Churn (0)
Training the Logistic Regression Model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Split data
X = df.drop('churn', axis=1)
y = df['churn']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features (required for Logistic Regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
log_model = LogisticRegression()
log_model.fit(X_train_scaled, y_train)
# Predict
y_pred = log_model.predict(X_test_scaled)
y_pred_proba = log_model.predict_proba(X_test_scaled)[:, 1]
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
Classification Evaluation Metrics
Classification evaluation is more complex than regression. You cannot just look at accuracy — you need to understand where the model makes mistakes and what kind of mistakes it makes.
The Confusion Matrix
The Confusion Matrix is the foundation of classification evaluation. It shows four outcomes:
| | Actual Positive | Actual Negative | |:---|:--------------:|:--------------:| | Predicted Positive | True Positive (TP) | False Positive (FP) | | Predicted Negative | False Negative (FN) | True Negative (TN) |
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# Manual interpretation
TN, FP, FN, TP = cm.ravel()
print(f"True Negatives (TN): {TN} — Correctly predicted no churn")
print(f"False Positives (FP): {FP} — Falsely flagged as churn (wasted coupon)")
print(f"False Negatives (FN): {FN} — Missed churn (most severe!)")
print(f"True Positives (TP): {TP} — Correctly predicted churn")
Precision vs. Recall
Precision: Of all customers predicted to churn, how many actually churned? $$Precision = \frac{TP}{TP + FP}$$
- Low precision = too many false alarms = wasted coupon costs
Recall: Of all customers who actually churned, how many did we catch? $$Recall = \frac{TP}{TP + FN}$$
- Low recall = too many missed churners = lost customers
F1 Score: The harmonic mean of Precision and Recall — one number that balances both $$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$$
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['No Churn', 'Churn']))
Why Accuracy Can Be Misleading
If only 5% of customers churn, a model that predicts "No Churn" for everyone achieves 95% accuracy — but is completely useless! It never catches anyone who is about to leave.
When data is imbalanced (few churners), focus on Recall and Precision, not Accuracy. This is one of the most important lessons in machine learning.
Decision Tree
A Decision Tree is another highly intuitive classification algorithm. It works like a "if...then..." quiz game where each question splits the data into smaller groups.
Has Contract?
├── No: Frequent Complaints?
│ ├── Yes: Predict Churn ❌
│ └── No: Tenure > 12 months?
│ ├── Yes: Predict No Churn ✅
│ └── No: Predict Churn ❌
└── Yes: Predict No Churn ✅
Why Decision Trees are popular:
- Completely transparent — you can read and explain every decision
- No feature scaling needed — works directly with raw data
- Handles both numeric and categorical data naturally
- Built-in feature importance tells you what matters most
from sklearn.tree import DecisionTreeClassifier
# Decision Tree does not require scaling
dt_model = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_model.fit(X_train, y_train)
# Predict and evaluate
y_pred_dt = dt_model.predict(X_test)
print(f"Decision Tree Accuracy: {accuracy_score(y_test, y_pred_dt):.4f}")
print("\nDecision Tree Classification Report:")
print(classification_report(y_test, y_pred_dt, target_names=['No Churn', 'Churn']))
max_depth=5 limits the tree depth, preventing overfitting.
Feature Importance Analysis
One of the most powerful features of Decision Trees (and tree-based models in general) is that they can tell you which features matter most for making predictions:
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Importance': dt_model.feature_importances_
}).sort_values('Importance', ascending=False)
print("Feature Importances:")
print(feature_importance)
# Plot feature importance
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importance Analysis')
plt.gca().invert_yaxis()
plt.show()
Interpreting Feature Importance
The output shows which features the model relied on most:
- num_complaints: Customers who complain are far more likely to churn
- tenure_months: Long-time customers are much more loyal
- has_contract: Contract customers rarely churn
- monthly_charges: Higher bills correlate with higher churn risk
This information is pure gold for business decisions: you know exactly which customer behaviors to monitor and which interventions will have the biggest impact.
Using Vibe Coding for Classification
🔥 Vibe Coding Prompt for Classification "I have a customer_data.csv with columns: tenure_months, monthly_charges, total_charges, num_support_tickets, has_contract, avg_order_value, num_complaints. Please help me:
- Train both LogisticRegression and DecisionTreeClassifier churn prediction models
- Compare their Accuracy, Precision, Recall, and F1 Score
- Plot the Confusion Matrix for both models
- Output decision tree feature importance ranking
- Recommend which model is better for this task and explain why"
Summary
In this chapter, you learned:
- What classification is: Predicting a category (Churn/No Churn, Fraud/Legitimate) instead of a continuous value
- Why it matters: Classification powers fraud detection, churn prediction, medical diagnosis, spam filtering — most real-world ML applications
- How Logistic Regression works: Calculate a linear combination, then pass it through the Sigmoid function to get a probability between 0 and 1
Key takeaways:
- Classification is fundamentally different from regression — it predicts categories, not numbers
- Accuracy is not enough — with imbalanced data, a useless model can still hit 95% accuracy
- Confusion Matrix is the foundation: TP, TN, FP, FN tell you exactly where the model errs
- Precision vs Recall is a trade-off — optimize for Recall when missing a churner is costly, optimize for Precision when false alarms are expensive
- F1 Score balances Precision and Recall into one metric
- Decision Trees are transparent and interpretable — you can read the if-then rules directly
- Feature Importance reveals which factors drive predictions, enabling data-driven business decisions
What Is Next: Random Forest and Overfitting
The next chapter takes on the biggest challenge in machine learning: overfitting. You will learn why a deep decision tree memorizes noise instead of learning patterns, and how Random Forest solves this by combining hundreds of trees into one robust, stable model. We will also cover cross-validation, hyperparameter tuning, and model comparison.