Your First ML Model: Linear Regression for House Price Prediction
After two chapters of data preparation, we finally train our first ML model!
The problem: given various features of a house (income, age, rooms, location), predict its median price.
This is a classic regression problem — predicting a continuous numerical value.
Intuitive Understanding of Linear Regression
Linear Regression is the simplest and most fundamental regression algorithm. Its core concept: draw the straight line that best fits your data.
Imagine plotting dots on paper — each dot is a house (X=size, Y=price). Linear regression finds the line closest to all points.
The math behind it is simple:
$$y = mx + b$$
- $y$: predicted price (output)
- $x$: house size (input feature)
- $m$: slope (weight) — price increase per unit of size
- $b$: intercept — base price when size is 0
With multiple features (size, rooms, age), the formula becomes:
$$y = w_1x_1 + w_2x_2 + ... + w_nx_n + b$$
The "learning" process is automatically finding the optimal $w$ (weights) and $b$ (bias) that minimize prediction error.
Training with Scikit-Learn
Scikit-Learn is the most popular ML library in Python. Its API is beautifully consistent:
# 1. Import the algorithm
from sklearn.linear_model import LinearRegression
# 2. Create the model
model = LinearRegression()
# 3. Train the model (feed it data)
model.fit(X_train, y_train)
# 4. Make predictions
y_pred = model.predict(X_test)
That is it — only four lines!
Complete Code
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
# === 1. Load Data ===
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target
# === 2. Split Data ===
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# === 3. Train Model ===
model = LinearRegression()
model.fit(X_train, y_train)
# === 4. Predict ===
y_pred = model.predict(X_test)
# === 5. Evaluate ===
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R² Score: {r2:.4f}")
# === 6. Plot Predictions vs Actual ===
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Linear Regression: Predicted vs Actual')
plt.show()
How to Interpret Evaluation Metrics
After training, the key question: how accurate is the model?
1. Mean Absolute Error (MAE)
$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$
MAE means "on average, how far off is each prediction?"
- If MAE = 0.5, each prediction is off by $50,000 on average
- Pros: intuitive, same unit as the target
- Cons: does not penalize large errors enough
2. Root Mean Squared Error (RMSE)
$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$
RMSE is similar to MAE but penalizes large errors more heavily (errors are squared before averaging).
- If RMSE >> MAE: the model makes occasional very large errors
3. R² Score (The Most Important Metric!)
R² represents "how much of the data variance the model explains":
- R² = 1.0: perfect prediction (impossible in practice)
- R² = 0.8: model explains 80% of variance — very good
- R² = 0.5: model explains half — barely usable
- R² = 0.0: model is no better than guessing the mean
- R² < 0.0: model is worse than guessing — something is wrong
For complex real-world problems like house price prediction, R² between 0.6 and 0.8 is considered quite good.
Inspecting What the Model Learned
After training, we can examine each feature's coefficient to understand what drives prices:
# Feature names (California Housing dataset)
feature_names = [
'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
'Population', 'AveOccup', 'Latitude', 'Longitude'
]
# View each feature coefficient
coefficients = pd.DataFrame({
'Feature': feature_names,
'Coefficient': model.coef_
})
print(coefficients.sort_values('Coefficient', ascending=False))
# View intercept
print(f"\nIntercept (b): {model.intercept_:.4f}")
Example output:
Feature Coefficient
0 MedInc 0.4375
2 AveRooms 0.0102
4 Population -0.0007
3 AveBedrms -0.0064
7 Longitude -0.0412
1 HouseAge -0.0058
6 Latitude -0.0421
5 AveOccup -0.0090
This tells us:
- MedInc (median income) has the largest positive impact
- Latitude/Longitude matter — location drives price
- AveBedrms is negative — more bedrooms in small spaces may indicate dense housing
Use Vibe Coding to Train the Model
Do not want to write code manually? Let AI handle it:
🔥 Vibe Coding Prompt for Model Training "I have clean_house_data.csv. Please help me:
- Train a LinearRegression model to predict house prices.
- Calculate and display MAE, RMSE, and R².
- Plot predicted vs actual values.
- Show each feature's coefficient and explain which matters most.
- Save the model as house_price_model.pkl using Joblib.
- Write a predict() function that loads the model and predicts new house prices."
Summary
In this chapter, you learned:
- ✅ Linear Regression: Fit a line to minimize prediction error
- ✅ Scikit-Learn API: The unified fit() → predict() workflow
- ✅ Evaluation Metrics: MAE, RMSE, R² — what each measures and why
- ✅ Coefficient Analysis: Understanding what drives predictions
- ✅ Model Persistence: Save and load models with Joblib
What Is Next: Classification
Linear regression predicts continuous values. The next chapter tackles classification — predicting whether a customer will churn, whether an email is spam, or whether a transaction is fraudulent. You will learn logistic regression, decision boundaries, confusion matrices, and precision vs. recall.