Linear Regression from Scratch
Vibe Prompt
"Train linear regression from scratch using Adam Optimizer, and compare your custom Adam implementation with sklearn's results."
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
# Generate synthetic data
X, y = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42)
y = y.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Standardize features
mean, std = X_train.mean(axis=0), X_train.std(axis=0)
X_train = (X_train - mean) / std
X_test = (X_test - mean) / std
# Custom Adam training
def train_adam(X, y, lr=0.01, epochs=200):
m, n = X.shape
w = np.zeros((n, 1))
b = 0.0
m_w, v_w = np.zeros((n, 1)), np.zeros((n, 1))
m_b, v_b = 0.0, 0.0
beta1, beta2 = 0.9, 0.999
eps = 1e-8
losses = []
for t in range(1, epochs+1):
pred = X @ w + b
loss = np.mean((pred - y)**2)
losses.append(loss)
dw = (2/m) * X.T @ (pred - y)
db = (2/m) * np.sum(pred - y)
m_w = beta1*m_w + (1-beta1)*dw
v_w = beta2*v_w + (1-beta2)*dw*dw
m_b = beta1*m_b + (1-beta1)*db
v_b = beta2*v_b + (1-beta2)*db*db
m_w_hat = m_w / (1-beta1**t)
v_w_hat = v_w / (1-beta2**t)
m_b_hat = m_b / (1-beta1**t)
v_b_hat = v_b / (1-beta2**t)
w -= lr * m_w_hat / (np.sqrt(v_w_hat) + eps)
b -= lr * m_b_hat / (np.sqrt(v_b_hat) + eps)
return w, b, losses
w, b, losses = train_adam(X_train, y_train)
pred = X_test @ w + b
print(f"Custom Adam R²: {r2_score(y_test, pred):.4f}")
# sklearn comparison
lr = LinearRegression()
lr.fit(X_train, y_train)
print(f"sklearn R²: {r2_score(y_test, lr.predict(X_test)):.4f}")
print(f"Weight difference: {np.mean(np.abs(w.flatten() - lr.coef_)):.6f}")
Summary
- ✅ Gradient Descent fundamentals
- ✅ Momentum / RMSProp / Adam
- ✅ SGD / Mini-Batch
- ✅ Automatic differentiation engine
- ✅ Linear regression from scratch
Chapter Summary
- Understand the core concepts and theory
- Master implementation methods and techniques
- Learn common issues and their solutions
- Apply knowledge to real-world projects
Further Reading
- Official documentation and API references
- Open source projects on GitHub
- Related technical books and courses
- Community discussions and technical blogs
Implementation Examples
Basic Examples
# This section provides a complete implementation example
# to help you apply what you've learned to real projects
Steps
- Initialization: Set up the development environment and required tools
- Data Preparation: Collect and organize the required data
- Core Implementation: Implement the main functionality and logic
- Testing & Validation: Ensure the functionality works correctly
- Optimization: Tune performance and user experience
Common Errors
| Error Type | Possible Cause | Solution | |-----------|---------------|----------| | Compilation Error | Syntax issues | Check code syntax | | Runtime Error | Environment issues | Verify dependencies are installed | | Logic Error | Algorithm issues | Step-by-step debugging and testing | | Performance Issue | Efficiency issues | Use performance analysis tools |
Code Example
# Example code
import sys
def main():
# Main program logic
print("Hello, World!")
if __name__ == "__main__":
main()
Related Resources
- Official documentation
- API reference manuals
- Open source project examples
- Technical community discussions
Linear Regression with Gradient Descent
Why Gradient Descent for Linear Regression?
Linear regression has a closed-form solution: $\hat{\theta} = (X^T X)^{-1} X^T y$. But gradient descent is preferred when:
| Scenario | Closed-Form | Gradient Descent | |----------|-------------|------------------| | n_features > 10,000 | ❌ O(n³) matrix inversion | ✅ O(n²) per epoch | | Streaming data | ❌ Must retrain from scratch | ✅ Online updates | | Ridge/Lasso regularization | ✅ Still closed-form | ✅ Works naturally | | Feature count > sample count | ❌ Matrix singular | ✅ Works fine |
Mean Squared Error Loss
$$J(\theta) = \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2$$
$$\frac{\partial J}{\partial \theta_j} = \frac{2}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}$$
Complete Training Comparison
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Train custom Adam
w, b = train_adam(X_train, y_train)
y_pred_custom = X_test @ w + b
r2_custom = r2_score(y_test, y_pred_custom)
# Train sklearn
sk_model = LinearRegression()
sk_model.fit(X_train, y_train)
y_pred_sklearn = sk_model.predict(X_test)
r2_sklearn = r2_score(y_test, y_pred_sklearn)
print(f"Custom Adam R²: {r2_custom:.4f}")
print(f"sklearn R²: {r2_sklearn:.4f}")
print(f"Weights match: {np.allclose(w.flatten(), sk_model.coef_.flatten(), atol=0.1)}")
Evaluation Metrics
| Metric | Formula | Range | Best | |--------|---------|-------|------| | R² (coefficient of determination) | $1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$ | (-∞, 1] | 1.0 | | MAE (Mean Absolute Error) | $\frac{1}{n} \sum |y_i - \hat{y}_i|$ | [0, ∞) | 0 | | RMSE (Root Mean Squared Error) | $\sqrt{\frac{1}{n} \sum (y_i - \hat{y}_i)^2}$ | [0, ∞) | 0 |
Key Takeaways
| Linear regression predicts continuous values from features | | Closed-form (normal equation) works for small feature sets | | Gradient descent scales to millions of features | | Adam optimizer converges faster than standard GD | | Standardize features before training for stable convergence | | R² measures how much variance the model explains | | RMSE is in the same units as the target variable | | Custom implementation + sklearn comparison validates correctness |
Next Chapter: Advanced Topics
This course continues with advanced optimization topics.