title: "Stochastic Gradient Descent (SGD)" description: "From batch GD to SGD and Mini-Batch, implement all three variants." order: 3

SGD vs Batch vs Mini-Batch

Vibe Prompt

"Compare Batch GD, SGD, Mini-Batch SGD on linear regression."

import numpy as np

def batch_gd(X, y, lr=0.01, epochs=100):
    m, n = X.shape
    w = np.zeros(n)
    losses = []
    for _ in range(epochs):
        pred = X @ w
        loss = np.mean((pred - y)**2)
        losses.append(loss)
        grad = (2/m) * X.T @ (pred - y)
        w -= lr * grad
    return w, losses

def sgd(X, y, lr=0.01, epochs=100):
    m, n = X.shape
    w = np.zeros(n)
    losses = []
    for _ in range(epochs):
        idx = np.random.randint(m)
        xi, yi = X[idx:idx+1], y[idx:idx+1]
        pred = xi @ w
        loss = np.mean((X @ w - y)**2)
        grad = 2 * xi.T @ (pred - yi)
        w -= lr * grad
        losses.append(loss)
    return w, losses

def minibatch_sgd(X, y, lr=0.01, epochs=100, batch_size=32):
    m, n = X.shape
    w = np.zeros(n)
    losses = []
    for _ in range(epochs):
        idx = np.random.choice(m, batch_size, replace=False)
        Xb, yb = X[idx], y[idx]
        pred = Xb @ w
        loss = np.mean((X @ w - y)**2)
        grad = (2/len(idx)) * Xb.T @ (pred - yb)
        w -= lr * grad
        losses.append(loss)
    return w, losses

# Test all three
np.random.seed(42)
X = np.random.randn(1000, 5)
true_w = np.array([3, -2, 1, 0.5, -1])
y = X @ true_w + np.random.randn(1000) * 0.1

w_b, _ = batch_gd(X, y, epochs=200)
w_s, _ = sgd(X, y, epochs=200)
w_m, _ = minibatch_sgd(X, y, epochs=200, batch_size=32)

print(f"True weights: {true_w}")
print(f"Batch GD: {np.round(w_b, 3)}")
print(f"SGD:      {np.round(w_s, 3)}")
print(f"Mini-Batch:{np.round(w_m, 3)}")

Comparison

| Characteristic | Batch GD | SGD | Mini-Batch | |----------------|:--------:|:---:|:----------:| | Data per step | m (all) | 1 | 32-256 | | Cost per step | O(mn) | O(n) | O(bn) | | Stability | Most stable | Least stable | Medium | | Escape local minima | No | Yes | Yes | | Parallelization | High | Low | High | | Generalization | Worse (overfits) | Better | Better |

Learning Rate Scheduling

SGD needs decreasing LR during training:

def lr_decay(initial_lr, epoch, decay_type="step"):
    if decay_type == "step":
        return initial_lr * (0.1 ** (epoch // 30))
    elif decay_type == "exp":
        return initial_lr * (0.95 ** epoch)
    elif decay_type == "cosine":
        return 0.5 * initial_lr * (1 + np.cos(np.pi * epoch / 100))

Batch Size Impact

| Batch Size | Noise | Convergence Speed | Memory | |:----------:|:-----:|:----------------:|:-----:| | 1 (SGD) | High | Fast/step, slow overall | Low | | 32-64 | Medium | Best balance | Medium | | 128-256 | Low | Slow/step, fast overall | High | | 1024+ | Very low | May generalize poorly | Very high |


Practical Advice

  1. Default to Mini-Batch SGD with batch size 32-128
  2. SGD for small datasets (< 10K samples)
  3. Learning rate scheduling is critical for convergence
  4. Large batches (> 1024) need Linear Scaling Rule
  5. Shuffle data before each epoch to avoid order bias

Convergence Theory

SGD gradient has randomness. Analysis needed from expectation:

$$E[L(w_{t+1})] \leq L(w_t) - \eta |\nabla L(w_t)|^2 + \eta^2 \sigma^2$$

  • $\eta |\nabla L|^2$: expected improvement term
  • $\eta^2 \sigma^2$: noise term

Learning rate cannot be too large (noise dominates) or too small (improvement too slow).

Next Chapter: FFT / DTFT

From optimization to signal processing - using FFT for frequency analysis.

What Is Stochastic Gradient Descent?

SGD updates parameters using a single random training example (or a small mini-batch) instead of the full dataset.

Full Batch vs Mini-Batch vs SGD

| Method | Samples Per Update | Speed | Accuracy | Memory | |--------|-------------------|-------|----------|--------| | Full Batch GD | All samples | Slow per step | Most accurate gradient | High memory | | Mini-Batch SGD | 32-256 samples | Fast | Noisy but efficient | Low memory | | Stochastic SGD | 1 sample | Very fast | Very noisy gradient | Minimal |

Why Mini-Batch Works

import numpy as np

def mini_batch_sgd(X, y, lr=0.01, batch_size=32, epochs=100):
    """Mini-batch SGD for linear regression."""
    n_samples, n_features = X.shape
    theta = np.zeros(n_features)
    
    for epoch in range(epochs):
        # Shuffle data
        indices = np.random.permutation(n_samples)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        # Process in mini-batches
        for start in range(0, n_samples, batch_size):
            end = min(start + batch_size, n_samples)
            X_batch = X_shuffled[start:end]
            y_batch = y_shuffled[start:end]
            
            # Compute gradient on mini-batch
            predictions = X_batch @ theta
            errors = predictions - y_batch
            gradient = (X_batch.T @ errors) / len(X_batch)
            
            # Update
            theta -= lr * gradient
        
        # Compute full loss every epoch
        if epoch % 10 == 0:
            predictions = X @ theta
            loss = np.mean((predictions - y) ** 2)
            print(f"Epoch {epoch}: loss = {loss:.4f}")
    
    return theta

Batch Size Impact

| Batch Size | Update Noise | Training Time | Generalization | |-----------|-------------|---------------|----------------| | 1 (pure SGD) | Very high | Fast steps, many total | May generalize better | | 32 | Moderate | Good balance | Good | | 128 | Low | Fewer steps | Standard choice | | Full dataset | None | Slow per step | May overfit |

Learning Rate for SGD

SGD needs a different learning rate strategy than full batch GD.

def sgd_with_momentum(theta, gradient, lr=0.01, momentum=0.9):
    """SGD with momentum for smoother updates."""
    velocity = np.zeros_like(theta)
    
    for step in range(1000):
        grad = gradient(theta)
        velocity = momentum * velocity + lr * grad
        theta -= velocity
    
    return theta

Summary

SGD and mini-batch SGD make gradient descent practical for large datasets. Mini-batch strikes the best balance between speed and accuracy.

Key takeaways: | Batch GD: uses all data โ€” accurate but slow and high memory | | Mini-batch SGD: uses 32-256 samples โ€” best balance for most tasks | | Pure SGD: uses 1 sample โ€” noisy but escapes local minima easily | | Shuffle data before each epoch to prevent biased gradients | | Smaller batches add noise that can improve generalization | | Mini-batch with momentum smooths the noisy gradient path | | Batch size 32 is the default for most deep learning frameworks |

Next Chapter: FFT / DTFT

The next chapter covers frequency analysis using FFT.

Unlock Full Tutorial

This chapter is paid content. Join the project to unlock over 5000 words of deep analysis, including 10+ god-tier Prompts and real Source Code examples!