Your First ML Model: Linear Regression for House Price Prediction

After two chapters of data preparation, we finally train our first ML model!

The problem: given various features of a house (income, age, rooms, location), predict its median price.

This is a classic regression problem — predicting a continuous numerical value.

Intuitive Understanding of Linear Regression

Linear Regression is the simplest and most fundamental regression algorithm. Its core concept: draw the straight line that best fits your data.

Imagine plotting dots on paper — each dot is a house (X=size, Y=price). Linear regression finds the line closest to all points.

The math behind it is simple:

$$y = mx + b$$

  • $y$: predicted price (output)
  • $x$: house size (input feature)
  • $m$: slope (weight) — price increase per unit of size
  • $b$: intercept — base price when size is 0

With multiple features (size, rooms, age), the formula becomes:

$$y = w_1x_1 + w_2x_2 + ... + w_nx_n + b$$

The "learning" process is automatically finding the optimal $w$ (weights) and $b$ (bias) that minimize prediction error.

Training with Scikit-Learn

Scikit-Learn is the most popular ML library in Python. Its API is beautifully consistent:

# 1. Import the algorithm
from sklearn.linear_model import LinearRegression

# 2. Create the model
model = LinearRegression()

# 3. Train the model (feed it data)
model.fit(X_train, y_train)

# 4. Make predictions
y_pred = model.predict(X_test)

That is it — only four lines!

Complete Code

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt

# === 1. Load Data ===
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# === 2. Split Data ===
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# === 3. Train Model ===
model = LinearRegression()
model.fit(X_train, y_train)

# === 4. Predict ===
y_pred = model.predict(X_test)

# === 5. Evaluate ===
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R² Score: {r2:.4f}")

# === 6. Plot Predictions vs Actual ===
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Linear Regression: Predicted vs Actual')
plt.show()

How to Interpret Evaluation Metrics

After training, the key question: how accurate is the model?

1. Mean Absolute Error (MAE)

$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$

MAE means "on average, how far off is each prediction?"

  • If MAE = 0.5, each prediction is off by $50,000 on average
  • Pros: intuitive, same unit as the target
  • Cons: does not penalize large errors enough

2. Root Mean Squared Error (RMSE)

$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$

RMSE is similar to MAE but penalizes large errors more heavily (errors are squared before averaging).

  • If RMSE >> MAE: the model makes occasional very large errors

3. R² Score (The Most Important Metric!)

R² represents "how much of the data variance the model explains":

  • R² = 1.0: perfect prediction (impossible in practice)
  • R² = 0.8: model explains 80% of variance — very good
  • R² = 0.5: model explains half — barely usable
  • R² = 0.0: model is no better than guessing the mean
  • R² < 0.0: model is worse than guessing — something is wrong

For complex real-world problems like house price prediction, R² between 0.6 and 0.8 is considered quite good.

Inspecting What the Model Learned

After training, we can examine each feature's coefficient to understand what drives prices:

# Feature names (California Housing dataset)
feature_names = [
    'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
    'Population', 'AveOccup', 'Latitude', 'Longitude'
]

# View each feature coefficient
coefficients = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': model.coef_
})
print(coefficients.sort_values('Coefficient', ascending=False))

# View intercept
print(f"\nIntercept (b): {model.intercept_:.4f}")

Example output:

     Feature    Coefficient
0  MedInc       0.4375
2  AveRooms     0.0102
4  Population  -0.0007
3  AveBedrms   -0.0064
7  Longitude   -0.0412
1  HouseAge    -0.0058
6  Latitude    -0.0421
5  AveOccup    -0.0090

This tells us:

  • MedInc (median income) has the largest positive impact
  • Latitude/Longitude matter — location drives price
  • AveBedrms is negative — more bedrooms in small spaces may indicate dense housing

Use Vibe Coding to Train the Model

Do not want to write code manually? Let AI handle it:

🔥 Vibe Coding Prompt for Model Training "I have clean_house_data.csv. Please help me:

  1. Train a LinearRegression model to predict house prices.
  2. Calculate and display MAE, RMSE, and R².
  3. Plot predicted vs actual values.
  4. Show each feature's coefficient and explain which matters most.
  5. Save the model as house_price_model.pkl using Joblib.
  6. Write a predict() function that loads the model and predicts new house prices."

Summary

In this chapter, you learned:

  1. Linear Regression: Fit a line to minimize prediction error
  2. Scikit-Learn API: The unified fit() → predict() workflow
  3. Evaluation Metrics: MAE, RMSE, R² — what each measures and why
  4. Coefficient Analysis: Understanding what drives predictions
  5. Model Persistence: Save and load models with Joblib

What Is Next: Classification

Linear regression predicts continuous values. The next chapter tackles classification — predicting whether a customer will churn, whether an email is spam, or whether a transaction is fraudulent. You will learn logistic regression, decision boundaries, confusion matrices, and precision vs. recall.

Unlock Full Tutorial

This chapter is paid content. Join the project to unlock over 5000 words of deep analysis, including 10+ god-tier Prompts and real Source Code examples!