title: "Stochastic Gradient Descent (SGD)" description: "From batch GD to SGD and Mini-Batch, implement all three variants." order: 3
SGD vs Batch vs Mini-Batch
Vibe Prompt
"Compare Batch GD, SGD, Mini-Batch SGD on linear regression."
import numpy as np
def batch_gd(X, y, lr=0.01, epochs=100):
m, n = X.shape
w = np.zeros(n)
losses = []
for _ in range(epochs):
pred = X @ w
loss = np.mean((pred - y)**2)
losses.append(loss)
grad = (2/m) * X.T @ (pred - y)
w -= lr * grad
return w, losses
def sgd(X, y, lr=0.01, epochs=100):
m, n = X.shape
w = np.zeros(n)
losses = []
for _ in range(epochs):
idx = np.random.randint(m)
xi, yi = X[idx:idx+1], y[idx:idx+1]
pred = xi @ w
loss = np.mean((X @ w - y)**2)
grad = 2 * xi.T @ (pred - yi)
w -= lr * grad
losses.append(loss)
return w, losses
def minibatch_sgd(X, y, lr=0.01, epochs=100, batch_size=32):
m, n = X.shape
w = np.zeros(n)
losses = []
for _ in range(epochs):
idx = np.random.choice(m, batch_size, replace=False)
Xb, yb = X[idx], y[idx]
pred = Xb @ w
loss = np.mean((X @ w - y)**2)
grad = (2/len(idx)) * Xb.T @ (pred - yb)
w -= lr * grad
losses.append(loss)
return w, losses
# Test all three
np.random.seed(42)
X = np.random.randn(1000, 5)
true_w = np.array([3, -2, 1, 0.5, -1])
y = X @ true_w + np.random.randn(1000) * 0.1
w_b, _ = batch_gd(X, y, epochs=200)
w_s, _ = sgd(X, y, epochs=200)
w_m, _ = minibatch_sgd(X, y, epochs=200, batch_size=32)
print(f"True weights: {true_w}")
print(f"Batch GD: {np.round(w_b, 3)}")
print(f"SGD: {np.round(w_s, 3)}")
print(f"Mini-Batch:{np.round(w_m, 3)}")
Comparison
| Characteristic | Batch GD | SGD | Mini-Batch | |----------------|:--------:|:---:|:----------:| | Data per step | m (all) | 1 | 32-256 | | Cost per step | O(mn) | O(n) | O(bn) | | Stability | Most stable | Least stable | Medium | | Escape local minima | No | Yes | Yes | | Parallelization | High | Low | High | | Generalization | Worse (overfits) | Better | Better |
Learning Rate Scheduling
SGD needs decreasing LR during training:
def lr_decay(initial_lr, epoch, decay_type="step"):
if decay_type == "step":
return initial_lr * (0.1 ** (epoch // 30))
elif decay_type == "exp":
return initial_lr * (0.95 ** epoch)
elif decay_type == "cosine":
return 0.5 * initial_lr * (1 + np.cos(np.pi * epoch / 100))
Batch Size Impact
| Batch Size | Noise | Convergence Speed | Memory | |:----------:|:-----:|:----------------:|:-----:| | 1 (SGD) | High | Fast/step, slow overall | Low | | 32-64 | Medium | Best balance | Medium | | 128-256 | Low | Slow/step, fast overall | High | | 1024+ | Very low | May generalize poorly | Very high |
Practical Advice
- Default to Mini-Batch SGD with batch size 32-128
- SGD for small datasets (< 10K samples)
- Learning rate scheduling is critical for convergence
- Large batches (> 1024) need Linear Scaling Rule
- Shuffle data before each epoch to avoid order bias
Convergence Theory
SGD gradient has randomness. Analysis needed from expectation:
$$E[L(w_{t+1})] \leq L(w_t) - \eta |\nabla L(w_t)|^2 + \eta^2 \sigma^2$$
- $\eta |\nabla L|^2$: expected improvement term
- $\eta^2 \sigma^2$: noise term
Learning rate cannot be too large (noise dominates) or too small (improvement too slow).
Next Chapter: FFT / DTFT
From optimization to signal processing - using FFT for frequency analysis.
What Is Stochastic Gradient Descent?
SGD updates parameters using a single random training example (or a small mini-batch) instead of the full dataset.
Full Batch vs Mini-Batch vs SGD
| Method | Samples Per Update | Speed | Accuracy | Memory | |--------|-------------------|-------|----------|--------| | Full Batch GD | All samples | Slow per step | Most accurate gradient | High memory | | Mini-Batch SGD | 32-256 samples | Fast | Noisy but efficient | Low memory | | Stochastic SGD | 1 sample | Very fast | Very noisy gradient | Minimal |
Why Mini-Batch Works
import numpy as np
def mini_batch_sgd(X, y, lr=0.01, batch_size=32, epochs=100):
"""Mini-batch SGD for linear regression."""
n_samples, n_features = X.shape
theta = np.zeros(n_features)
for epoch in range(epochs):
# Shuffle data
indices = np.random.permutation(n_samples)
X_shuffled = X[indices]
y_shuffled = y[indices]
# Process in mini-batches
for start in range(0, n_samples, batch_size):
end = min(start + batch_size, n_samples)
X_batch = X_shuffled[start:end]
y_batch = y_shuffled[start:end]
# Compute gradient on mini-batch
predictions = X_batch @ theta
errors = predictions - y_batch
gradient = (X_batch.T @ errors) / len(X_batch)
# Update
theta -= lr * gradient
# Compute full loss every epoch
if epoch % 10 == 0:
predictions = X @ theta
loss = np.mean((predictions - y) ** 2)
print(f"Epoch {epoch}: loss = {loss:.4f}")
return theta
Batch Size Impact
| Batch Size | Update Noise | Training Time | Generalization | |-----------|-------------|---------------|----------------| | 1 (pure SGD) | Very high | Fast steps, many total | May generalize better | | 32 | Moderate | Good balance | Good | | 128 | Low | Fewer steps | Standard choice | | Full dataset | None | Slow per step | May overfit |
Learning Rate for SGD
SGD needs a different learning rate strategy than full batch GD.
def sgd_with_momentum(theta, gradient, lr=0.01, momentum=0.9):
"""SGD with momentum for smoother updates."""
velocity = np.zeros_like(theta)
for step in range(1000):
grad = gradient(theta)
velocity = momentum * velocity + lr * grad
theta -= velocity
return theta
Summary
SGD and mini-batch SGD make gradient descent practical for large datasets. Mini-batch strikes the best balance between speed and accuracy.
Key takeaways: | Batch GD: uses all data โ accurate but slow and high memory | | Mini-batch SGD: uses 32-256 samples โ best balance for most tasks | | Pure SGD: uses 1 sample โ noisy but escapes local minima easily | | Shuffle data before each epoch to prevent biased gradients | | Smaller batches add noise that can improve generalization | | Mini-batch with momentum smooths the noisy gradient path | | Batch size 32 is the default for most deep learning frameworks |
Next Chapter: FFT / DTFT
The next chapter covers frequency analysis using FFT.