Linear Regression from Scratch

Vibe Prompt

"Train linear regression from scratch using Adam Optimizer, and compare your custom Adam implementation with sklearn's results."

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

# Generate synthetic data
X, y = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42)
y = y.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Standardize features
mean, std = X_train.mean(axis=0), X_train.std(axis=0)
X_train = (X_train - mean) / std
X_test = (X_test - mean) / std

# Custom Adam training
def train_adam(X, y, lr=0.01, epochs=200):
    m, n = X.shape
    w = np.zeros((n, 1))
    b = 0.0
    m_w, v_w = np.zeros((n, 1)), np.zeros((n, 1))
    m_b, v_b = 0.0, 0.0
    beta1, beta2 = 0.9, 0.999
    eps = 1e-8
    losses = []
    
    for t in range(1, epochs+1):
        pred = X @ w + b
        loss = np.mean((pred - y)**2)
        losses.append(loss)
        
        dw = (2/m) * X.T @ (pred - y)
        db = (2/m) * np.sum(pred - y)
        
        m_w = beta1*m_w + (1-beta1)*dw
        v_w = beta2*v_w + (1-beta2)*dw*dw
        m_b = beta1*m_b + (1-beta1)*db
        v_b = beta2*v_b + (1-beta2)*db*db
        
        m_w_hat = m_w / (1-beta1**t)
        v_w_hat = v_w / (1-beta2**t)
        m_b_hat = m_b / (1-beta1**t)
        v_b_hat = v_b / (1-beta2**t)
        
        w -= lr * m_w_hat / (np.sqrt(v_w_hat) + eps)
        b -= lr * m_b_hat / (np.sqrt(v_b_hat) + eps)
    
    return w, b, losses

w, b, losses = train_adam(X_train, y_train)
pred = X_test @ w + b
print(f"Custom Adam R²: {r2_score(y_test, pred):.4f}")

# sklearn comparison
lr = LinearRegression()
lr.fit(X_train, y_train)
print(f"sklearn R²: {r2_score(y_test, lr.predict(X_test)):.4f}")
print(f"Weight difference: {np.mean(np.abs(w.flatten() - lr.coef_)):.6f}")

Summary

✅ Gradient Descent fundamentals
✅ Momentum / RMSProp / Adam
✅ SGD / Mini-Batch
✅ Automatic differentiation engine
✅ Linear regression from scratch

Chapter Summary

Understand the core concepts and theory
Master implementation methods and techniques
Learn common issues and their solutions
Apply knowledge to real-world projects

Implementation Examples

Basic Examples

# This section provides a complete implementation example
# to help you apply what you've learned to real projects

Steps

Initialization: Set up the development environment and required tools
Data Preparation: Collect and organize the required data
Core Implementation: Implement the main functionality and logic
Testing & Validation: Ensure the functionality works correctly
Optimization: Tune performance and user experience

Common Errors

| Error Type | Possible Cause | Solution | |-----------|---------------|----------| | Compilation Error | Syntax issues | Check code syntax | | Runtime Error | Environment issues | Verify dependencies are installed | | Logic Error | Algorithm issues | Step-by-step debugging and testing | | Performance Issue | Efficiency issues | Use performance analysis tools |

Code Example

# Example code
import sys

def main():
    # Main program logic
    print("Hello, World!")

if __name__ == "__main__":
    main()

Related Resources

Official documentation
API reference manuals
Open source project examples
Technical community discussions

Linear Regression with Gradient Descent

Why Gradient Descent for Linear Regression?

Linear regression has a closed-form solution: $\hat{\theta} = (X^T X)^{-1} X^T y$. But gradient descent is preferred when:

| Scenario | Closed-Form | Gradient Descent | |----------|-------------|------------------| | n_features > 10,000 | ❌ O(n³) matrix inversion | ✅ O(n²) per epoch | | Streaming data | ❌ Must retrain from scratch | ✅ Online updates | | Ridge/Lasso regularization | ✅ Still closed-form | ✅ Works naturally | | Feature count > sample count | ❌ Matrix singular | ✅ Works fine |

Mean Squared Error Loss

$$J(\theta) = \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2$$

$$\frac{\partial J}{\partial \theta_j} = \frac{2}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}$$

Complete Training Comparison

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Train custom Adam
w, b = train_adam(X_train, y_train)
y_pred_custom = X_test @ w + b
r2_custom = r2_score(y_test, y_pred_custom)

# Train sklearn
sk_model = LinearRegression()
sk_model.fit(X_train, y_train)
y_pred_sklearn = sk_model.predict(X_test)
r2_sklearn = r2_score(y_test, y_pred_sklearn)

print(f"Custom Adam R²: {r2_custom:.4f}")
print(f"sklearn R²:     {r2_sklearn:.4f}")
print(f"Weights match:  {np.allclose(w.flatten(), sk_model.coef_.flatten(), atol=0.1)}")

Evaluation Metrics

| Metric | Formula | Range | Best | |--------|---------|-------|------| | R² (coefficient of determination) | $1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$ | (-∞, 1] | 1.0 | | MAE (Mean Absolute Error) | $\frac{1}{n} \sum |y_i - \hat{y}_i|$ | [0, ∞) | 0 | | RMSE (Root Mean Squared Error) | $\sqrt{\frac{1}{n} \sum (y_i - \hat{y}_i)^2}$ | [0, ∞) | 0 |

Key Takeaways

| Linear regression predicts continuous values from features | | Closed-form (normal equation) works for small feature sets | | Gradient descent scales to millions of features | | Adam optimizer converges faster than standard GD | | Standardize features before training for stable convergence | | R² measures how much variance the model explains | | RMSE is in the same units as the target variable | | Custom implementation + sklearn comparison validates correctness |

Next Chapter: Advanced Topics

This course continues with advanced optimization topics.