Your First ML Model: Linear Regression for House Price Prediction

After two chapters of data preparation, we finally train our first ML model!

The problem: given various features of a house (income, age, rooms, location), predict its median price.

This is a classic regression problem — predicting a continuous numerical value.

Intuitive Understanding of Linear Regression

Linear Regression is the simplest and most fundamental regression algorithm. Its core concept: draw the straight line that best fits your data.

Imagine plotting dots on paper — each dot is a house (X=size, Y=price). Linear regression finds the line closest to all points.

The math behind it is simple:

$$y = mx + b$$

$y$: predicted price (output)
$x$: house size (input feature)
$m$: slope (weight) — price increase per unit of size
$b$: intercept — base price when size is 0

With multiple features (size, rooms, age), the formula becomes:

$$y = w_1x_1 + w_2x_2 + ... + w_nx_n + b$$

The "learning" process is automatically finding the optimal $w$ (weights) and $b$ (bias) that minimize prediction error.

Training with Scikit-Learn

Scikit-Learn is the most popular ML library in Python. Its API is beautifully consistent:

# 1. Import the algorithm
from sklearn.linear_model import LinearRegression

# 2. Create the model
model = LinearRegression()

# 3. Train the model (feed it data)
model.fit(X_train, y_train)

# 4. Make predictions
y_pred = model.predict(X_test)

That is it — only four lines!

Complete Code

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt

# === 1. Load Data ===
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# === 2. Split Data ===
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# === 3. Train Model ===
model = LinearRegression()
model.fit(X_train, y_train)

# === 4. Predict ===
y_pred = model.predict(X_test)

# === 5. Evaluate ===
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R² Score: {r2:.4f}")

# === 6. Plot Predictions vs Actual ===
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Linear Regression: Predicted vs Actual')
plt.show()

How to Interpret Evaluation Metrics

After training, the key question: how accurate is the model?

1. Mean Absolute Error (MAE)

$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$

MAE means "on average, how far off is each prediction?"

If MAE = 0.5, each prediction is off by $50,000 on average
Pros: intuitive, same unit as the target
Cons: does not penalize large errors enough

2. Root Mean Squared Error (RMSE)

$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$

RMSE is similar to MAE but penalizes large errors more heavily (errors are squared before averaging).

If RMSE >> MAE: the model makes occasional very large errors

3. R² Score (The Most Important Metric!)

R² represents "how much of the data variance the model explains":

R² = 1.0: perfect prediction (impossible in practice)
R² = 0.8: model explains 80% of variance — very good
R² = 0.5: model explains half — barely usable
R² = 0.0: model is no better than guessing the mean
R² < 0.0: model is worse than guessing — something is wrong

For complex real-world problems like house price prediction, R² between 0.6 and 0.8 is considered quite good.

Inspecting What the Model Learned

After training, we can examine each feature's coefficient to understand what drives prices:

# Feature names (California Housing dataset)
feature_names = [
    'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
    'Population', 'AveOccup', 'Latitude', 'Longitude'
]

# View each feature coefficient
coefficients = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': model.coef_
})
print(coefficients.sort_values('Coefficient', ascending=False))

# View intercept
print(f"\nIntercept (b): {model.intercept_:.4f}")

Example output:

     Feature    Coefficient
0  MedInc       0.4375
2  AveRooms     0.0102
4  Population  -0.0007
3  AveBedrms   -0.0064
7  Longitude   -0.0412
1  HouseAge    -0.0058
6  Latitude    -0.0421
5  AveOccup    -0.0090

This tells us:

MedInc (median income) has the largest positive impact
Latitude/Longitude matter — location drives price
AveBedrms is negative — more bedrooms in small spaces may indicate dense housing

Use Vibe Coding to Train the Model

Do not want to write code manually? Let AI handle it:

🔥 Vibe Coding Prompt for Model Training "I have clean_house_data.csv. Please help me:

Train a LinearRegression model to predict house prices.

Calculate and display MAE, RMSE, and R².

Plot predicted vs actual values.

Show each feature's coefficient and explain which matters most.

Save the model as house_price_model.pkl using Joblib.

Write a predict() function that loads the model and predicts new house prices."

Summary

In this chapter, you learned:

✅ Linear Regression: Fit a line to minimize prediction error
✅ Scikit-Learn API: The unified fit() → predict() workflow
✅ Evaluation Metrics: MAE, RMSE, R² — what each measures and why
✅ Coefficient Analysis: Understanding what drives predictions
✅ Model Persistence: Save and load models with Joblib

What Is Next: Classification

Linear regression predicts continuous values. The next chapter tackles classification — predicting whether a customer will churn, whether an email is spam, or whether a transaction is fraudulent. You will learn logistic regression, decision boundaries, confusion matrices, and precision vs. recall.