title: "Gradient Descent Fundamentals" description: "Understanding gradient descent from calculus, implement from scratch." order: 1

Gradient Descent Fundamentals

Intuitive Understanding

Imagine standing on a mountain blindfolded, trying to reach the valley. You feel which direction is downhill, take a step, and repeat.

  1. Feel the steepest downhill direction (negative gradient)
  2. Take a step of size (learning rate)
  3. Repeat until no more downhill (convergence)

$$w_{t+1} = w_t - \eta \cdot \nabla L(w_t)$$

$\eta$ is learning rate. $\nabla L(w_t)$ is gradient of loss function at $w_t$.

Implementation from Scratch

import numpy as np

def gradient_descent(gradient_func, initial_w, lr=0.1, n_steps=100):
    w = initial_w
    history = [w]
    for i in range(n_steps):
        grad = gradient_func(w)
        w = w - lr * grad
        history.append(w.copy())
    return w, history

Gradient Computation Methods

| Method | Formula | Characteristics | |--------|---------|-----------------| | Numerical | df/dx approx (f(x+h)-f(x-h))/(2h) | Simple, slow, good for validation | | Analytical | Manual formula derivation | Precise, fast, needs derivation | | Automatic | Computational graph chain rule | Mainstream in ML frameworks |

Learning Rate Impact

Learning rate is the most important hyperparameter:

  • Too small ($\eta = 0.001$): Converges extremely slowly
  • Moderate ($\eta = 0.1$): Smooth convergence, typical choice
  • Too large ($\eta = 1.0$): May oscillate or even diverge
def f(x, y): return x**2 + 2*y**2
def grad_f(x, y): return np.array([2*x, 4*y])

def gd_2d(start, lr=0.1, steps=50):
    path = [np.array(start)]
    w = np.array(start)
    for _ in range(steps):
        w = w - lr * grad_f(*w)
        path.append(w.copy())
    return np.array(path)

# Compare learning rates
for lr in [0.01, 0.1, 0.5]:
    path = gd_2d((3, 4), lr=lr, steps=50)
    print(f"lr={lr}: end point = {path[-1]}")

Convergence Criteria

When to stop?

  1. Gradient near zero: $|\nabla L(w)| < \epsilon$ (precise but expensive)
  2. Loss stops decreasing: $|L_{t+1} - L_t| < \epsilon$ (most practical)
  3. Fixed iterations: Reached max iterations (simple)

Three Variants of Gradient Descent

| Variant | Data per Step | Pros | Cons | |---------|:-------------:|------|------| | Batch GD | All data | Stable direction | Very slow on large data | | SGD | 1 data point | Fast, escapes local minima | Noisy direction | | Mini-Batch | 32-256 points | Balance speed and stability | Need to tune batch size |


Key Takeaways

  • Gradient descent iteratively updates parameters along negative gradient
  • Learning rate determines convergence speed and stability
  • Convex functions guarantee convergence to global minimum
  • Narrow valleys (high condition number) cause oscillation, needs Momentum

Learning Rate Scheduling

| Strategy | Method | Best For | |----------|--------|----------| | Step Decay | Halve every N epochs | CV training | | Exponential Decay | $\eta = \eta_0 \cdot e^{-kt}$ | Smooth decay | | Cosine Annealing | Cosine periodic adjustment | With restart | | Reduce on Plateau | Reduce when loss plateaus | When unsure |

Next Chapter: Momentum & Adam

Learning rate scheduling solves "how big each step." Next chapter solves "which direction" using inertia and adaptive learning rates.

Why Gradient Descent?

Many problems have no closed-form solution. Gradient descent iteratively finds the minimum of a function by following the negative gradient.

When to Use Gradient Descent

| Scenario | Closed-Form Solution | Gradient Descent | |----------|---------------------|------------------| | Linear regression with few features | ✅ Direct formula | ✅ Works too | | Linear regression with 10,000+ features | ❌ Matrix inversion too slow | ✅ Fast | | Neural networks | ❌ No closed form | ✅ Required | | Logistic regression | ❌ No closed form | ✅ Required |

The Algorithm

$$\theta_{t+1} = \theta_t - \eta \nabla J(\theta_t)$$

Where:

  • $\theta_t$ = parameters at step t
  • $\eta$ = learning rate (step size)
  • $\nabla J(\theta_t)$ = gradient of the cost function

Learning Rate Impact

| Learning Rate | Behavior | Result | |---------------|----------|--------| | Too small (0.0001) | Tiny steps | Slow convergence, many iterations | | Good (0.01) | Steady descent | Reaches minimum efficiently | | Too large (1.0) | Big jumps | Overshoots, may diverge |

Python Implementation

import numpy as np

def gradient_descent(gradient_func, start, lr=0.01, epochs=1000, tol=1e-6):
    """Generic gradient descent optimizer."""
    theta = np.array(start, dtype=float)
    history = [theta.copy()]
    
    for epoch in range(epochs):
        grad = gradient_func(theta)
        theta_new = theta - lr * grad
        history.append(theta_new.copy())
        
        # Check convergence
        if np.linalg.norm(theta_new - theta) < tol:
            print(f"Converged at epoch {epoch}")
            break
        
        theta = theta_new
    
    return theta, history

# Example: Minimize f(x) = x^2 + 2x + 1 (minimum at x = -1)
def gradient(x):
    return 2 * x + 2  # Derivative of x^2 + 2x + 1

result, _ = gradient_descent(gradient, start=[5.0], lr=0.1)
print(f"Minimum found at x = {result[0]:.4f}")
print(f"Expected: x = -1.0")

Learning Rate Schedules

| Schedule | Formula | Behavior | |----------|---------|----------| | Constant | $\eta_t = \eta_0$ | Simple, may oscillate | | Step decay | $\eta_t = \eta_0 \cdot \gamma^{\lfloor t / step \rfloor}$ | Drops at intervals | | Exponential | $\eta_t = \eta_0 \cdot e^{-kt}$ | Smooth decay | | Cosine | $\eta_t = \frac{1}{2}\eta_0(1 + \cos(\frac{t\pi}{T}))$ | Cyclical warmup/cool-down |

def learning_rate_schedule(epoch, lr0=0.1, decay=0.1, step_size=100):
    """Step decay schedule."""
    return lr0 * (decay ** (epoch // step_size))

# Usage in training loop
for epoch in range(1000):
    lr = learning_rate_schedule(epoch, lr0=0.1)
    theta -= lr * gradient(theta)

Summary

Gradient descent iteratively finds function minima by following the negative gradient. Learning rate selection and scheduling are critical for convergence.

Key takeaways: | Gradient descent: follow the negative gradient to reach the minimum | | Learning rate: too small = slow, too large = diverges | | Start with lr=0.01, adjust based on loss curve behavior | | Learning rate schedules: step decay, exponential, cosine | | Monitor convergence: stop when parameter change < tolerance | | Gradient descent scales to millions of parameters | | Required for neural networks — no closed-form solution exists |

Next Chapter: Momentum and Adam

The next chapter covers momentum and the Adam optimizer.

Member Exclusive Free Tutorial

This chapter is free exclusive content for registered members! Please login or register to unlock immediately.

Login / Register Now