title: "Gradient Descent Fundamentals" description: "Understanding gradient descent from calculus, implement from scratch." order: 1
Gradient Descent Fundamentals
Intuitive Understanding
Imagine standing on a mountain blindfolded, trying to reach the valley. You feel which direction is downhill, take a step, and repeat.
- Feel the steepest downhill direction (negative gradient)
- Take a step of size (learning rate)
- Repeat until no more downhill (convergence)
$$w_{t+1} = w_t - \eta \cdot \nabla L(w_t)$$
$\eta$ is learning rate. $\nabla L(w_t)$ is gradient of loss function at $w_t$.
Implementation from Scratch
import numpy as np
def gradient_descent(gradient_func, initial_w, lr=0.1, n_steps=100):
w = initial_w
history = [w]
for i in range(n_steps):
grad = gradient_func(w)
w = w - lr * grad
history.append(w.copy())
return w, history
Gradient Computation Methods
| Method | Formula | Characteristics | |--------|---------|-----------------| | Numerical | df/dx approx (f(x+h)-f(x-h))/(2h) | Simple, slow, good for validation | | Analytical | Manual formula derivation | Precise, fast, needs derivation | | Automatic | Computational graph chain rule | Mainstream in ML frameworks |
Learning Rate Impact
Learning rate is the most important hyperparameter:
- Too small ($\eta = 0.001$): Converges extremely slowly
- Moderate ($\eta = 0.1$): Smooth convergence, typical choice
- Too large ($\eta = 1.0$): May oscillate or even diverge
def f(x, y): return x**2 + 2*y**2
def grad_f(x, y): return np.array([2*x, 4*y])
def gd_2d(start, lr=0.1, steps=50):
path = [np.array(start)]
w = np.array(start)
for _ in range(steps):
w = w - lr * grad_f(*w)
path.append(w.copy())
return np.array(path)
# Compare learning rates
for lr in [0.01, 0.1, 0.5]:
path = gd_2d((3, 4), lr=lr, steps=50)
print(f"lr={lr}: end point = {path[-1]}")
Convergence Criteria
When to stop?
- Gradient near zero: $|\nabla L(w)| < \epsilon$ (precise but expensive)
- Loss stops decreasing: $|L_{t+1} - L_t| < \epsilon$ (most practical)
- Fixed iterations: Reached max iterations (simple)
Three Variants of Gradient Descent
| Variant | Data per Step | Pros | Cons | |---------|:-------------:|------|------| | Batch GD | All data | Stable direction | Very slow on large data | | SGD | 1 data point | Fast, escapes local minima | Noisy direction | | Mini-Batch | 32-256 points | Balance speed and stability | Need to tune batch size |
Key Takeaways
- Gradient descent iteratively updates parameters along negative gradient
- Learning rate determines convergence speed and stability
- Convex functions guarantee convergence to global minimum
- Narrow valleys (high condition number) cause oscillation, needs Momentum
Learning Rate Scheduling
| Strategy | Method | Best For | |----------|--------|----------| | Step Decay | Halve every N epochs | CV training | | Exponential Decay | $\eta = \eta_0 \cdot e^{-kt}$ | Smooth decay | | Cosine Annealing | Cosine periodic adjustment | With restart | | Reduce on Plateau | Reduce when loss plateaus | When unsure |
Next Chapter: Momentum & Adam
Learning rate scheduling solves "how big each step." Next chapter solves "which direction" using inertia and adaptive learning rates.
Why Gradient Descent?
Many problems have no closed-form solution. Gradient descent iteratively finds the minimum of a function by following the negative gradient.
When to Use Gradient Descent
| Scenario | Closed-Form Solution | Gradient Descent | |----------|---------------------|------------------| | Linear regression with few features | ✅ Direct formula | ✅ Works too | | Linear regression with 10,000+ features | ❌ Matrix inversion too slow | ✅ Fast | | Neural networks | ❌ No closed form | ✅ Required | | Logistic regression | ❌ No closed form | ✅ Required |
The Algorithm
$$\theta_{t+1} = \theta_t - \eta \nabla J(\theta_t)$$
Where:
- $\theta_t$ = parameters at step t
- $\eta$ = learning rate (step size)
- $\nabla J(\theta_t)$ = gradient of the cost function
Learning Rate Impact
| Learning Rate | Behavior | Result | |---------------|----------|--------| | Too small (0.0001) | Tiny steps | Slow convergence, many iterations | | Good (0.01) | Steady descent | Reaches minimum efficiently | | Too large (1.0) | Big jumps | Overshoots, may diverge |
Python Implementation
import numpy as np
def gradient_descent(gradient_func, start, lr=0.01, epochs=1000, tol=1e-6):
"""Generic gradient descent optimizer."""
theta = np.array(start, dtype=float)
history = [theta.copy()]
for epoch in range(epochs):
grad = gradient_func(theta)
theta_new = theta - lr * grad
history.append(theta_new.copy())
# Check convergence
if np.linalg.norm(theta_new - theta) < tol:
print(f"Converged at epoch {epoch}")
break
theta = theta_new
return theta, history
# Example: Minimize f(x) = x^2 + 2x + 1 (minimum at x = -1)
def gradient(x):
return 2 * x + 2 # Derivative of x^2 + 2x + 1
result, _ = gradient_descent(gradient, start=[5.0], lr=0.1)
print(f"Minimum found at x = {result[0]:.4f}")
print(f"Expected: x = -1.0")
Learning Rate Schedules
| Schedule | Formula | Behavior | |----------|---------|----------| | Constant | $\eta_t = \eta_0$ | Simple, may oscillate | | Step decay | $\eta_t = \eta_0 \cdot \gamma^{\lfloor t / step \rfloor}$ | Drops at intervals | | Exponential | $\eta_t = \eta_0 \cdot e^{-kt}$ | Smooth decay | | Cosine | $\eta_t = \frac{1}{2}\eta_0(1 + \cos(\frac{t\pi}{T}))$ | Cyclical warmup/cool-down |
def learning_rate_schedule(epoch, lr0=0.1, decay=0.1, step_size=100):
"""Step decay schedule."""
return lr0 * (decay ** (epoch // step_size))
# Usage in training loop
for epoch in range(1000):
lr = learning_rate_schedule(epoch, lr0=0.1)
theta -= lr * gradient(theta)
Summary
Gradient descent iteratively finds function minima by following the negative gradient. Learning rate selection and scheduling are critical for convergence.
Key takeaways: | Gradient descent: follow the negative gradient to reach the minimum | | Learning rate: too small = slow, too large = diverges | | Start with lr=0.01, adjust based on loss curve behavior | | Learning rate schedules: step decay, exponential, cosine | | Monitor convergence: stop when parameter change < tolerance | | Gradient descent scales to millions of parameters | | Required for neural networks — no closed-form solution exists |
Next Chapter: Momentum and Adam
The next chapter covers momentum and the Adam optimizer.