title: "Momentum & Adam Optimizer" description: "Implement Momentum, RMSProp, and Adam, compare convergence speeds." order: 2
Momentum & Adam
Momentum
Traditional GD oscillates in narrow valleys. Momentum adds inertia:
$$v_{t+1} = \beta v_t + \nabla L(w_t)$$ $$w_{t+1} = w_t - \eta v_{t+1}$$
$\beta$ (usually 0.9) controls inertia. Higher $\beta$ = smoother path but may overshoot.
RMSProp
Different parameters have different gradient scales. RMSProp maintains per-parameter learning rates:
$$s_{t+1} = \beta_2 s_t + (1-\beta_2)(\nabla L(w_t))^2$$ $$w_{t+1} = w_t - \frac{\eta}{\sqrt{s_{t+1}} + \epsilon} \nabla L(w_t)$$
- Large gradients: learning rate auto-decreases
- Small gradients: learning rate auto-increases
Adam = Momentum + RMSProp
Adam is the most commonly used optimizer, combining momentum with adaptive learning rates.
import numpy as np
def adam(grad, w0, lr=0.1, beta1=0.9, beta2=0.999, eps=1e-8, steps=100):
w = w0.copy()
m = np.zeros_like(w) # momentum
v = np.zeros_like(w) # adaptive LR
history = [w.copy()]
for t in range(1, steps+1):
g = grad(w)
m = beta1*m + (1-beta1)*g
v = beta2*v + (1-beta2)*g*g
m_hat = m / (1-beta1**t) # bias correction
v_hat = v / (1-beta2**t)
w = w - lr * m_hat / (np.sqrt(v_hat) + eps)
history.append(w.copy())
return w, history
Optimizer Comparison
| Feature | SGD | Momentum | RMSProp | Adam | |---------|:---:|:--------:|:-------:|:--------:| | Overcome oscillation | No | Yes | No | Yes | | Adaptive LR | No | No | Yes | Yes | | Inertia | No | Yes | No | Yes | | Bias correction | No | No | No | Yes | | Convergence speed | Slow | Medium | Medium | Fastest |
Selection Guide
| Scenario | Recommended | Reason | |----------|:-----------:|--------| | Simple convex | SGD / Momentum | No adaptive LR needed | | Computer Vision | SGD + Momentum | Better generalization | | NLP / Transformer | Adam / AdamW | Handles sparse gradients | | GAN / RL | Adam | Stabilizes unstable training | | Large models | AdamW | Adam + correct weight decay | | Resource limited | RMSProp | No momentum storage needed |
Key Takeaways
- Momentum uses inertia to solve GD oscillation problem
- RMSProp adapts learning rate per parameter
- Adam = Momentum + RMSProp + bias correction, most universal optimizer
- AdamW improves Adam for large language model training
- No universal optimizer - choose based on problem characteristics
Next Chapter: SGD
Adam operates on mini-batches. The next chapter explores batch size impact on training.
Why Momentum?
Standard gradient descent can oscillate in narrow valleys and get stuck in local minima. Momentum solves this by adding a fraction of the previous update to the current update.
Momentum Formula
$$v_t = \beta v_{t-1} + \eta \nabla L(\theta_t)$$ $$\theta_{t+1} = \theta_t - v_t$$
Where:
- $v_t$ = velocity (accumulated gradient)
- $\beta$ = momentum coefficient (typically 0.9)
- $\eta$ = learning rate
- $\nabla L(\theta_t)$ = gradient at current parameters
Momentum vs Standard GD
| Aspect | Standard GD | Momentum GD | |--------|-------------|-------------| | Update | Directly follows gradient | Accumulates gradient history | | Oscillation | High in narrow valleys | Dampened by momentum | | Local minima | Easily trapped | Can roll through | | Convergence | Slow | 2-3× faster | | $\beta$ | N/A | 0.9 (typical) |
Adam Optimizer
Adam (Adaptive Moment Estimation) combines momentum with per-parameter adaptive learning rates.
Adam Formula
$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$ $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$
$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$
$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$
Adam Parameters
| Parameter | Typical Value | Effect | |-----------|---------------|--------| | $\eta$ | 0.001 | Learning rate | | $\beta_1$ | 0.9 | Momentum decay | | $\beta_2$ | 0.999 | Adaptive rate decay | | $\epsilon$ | $10^{-8}$ | Numerical stability |
Python Implementation
import numpy as np
class Adam:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.m = {} # First moment
self.v = {} # Second moment
self.t = 0 # Time step
def update(self, params, grads):
self.t += 1
for key in params.keys():
if key not in self.m:
self.m[key] = np.zeros_like(grads[key])
self.v[key] = np.zeros_like(grads[key])
# Update biased moments
self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key] ** 2)
# Bias correction
m_hat = self.m[key] / (1 - self.beta1 ** self.t)
v_hat = self.v[key] / (1 - self.beta2 ** self.t)
# Update parameters
params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
return params
# Usage example
adam = Adam(lr=0.001)
params = {'w': np.array([1.0, 2.0]), 'b': np.array([0.0])}
grads = {'w': np.array([0.1, 0.2]), 'b': np.array([0.05])}
for epoch in range(100):
params = adam.update(params, grads)
if epoch % 20 == 0:
print(f"Epoch {epoch}: w={params['w']}, b={params['b']}")
Optimizer Comparison
| Optimizer | Adaptive LR | Momentum | Best For | |-----------|-------------|----------|----------| | SGD | ❌ No | ❌ No | Simple problems | | SGD + Momentum | ❌ No | ✅ Yes | Computer vision | | AdaGrad | ✅ Yes | ❌ No | Sparse data | | RMSProp | ✅ Yes | ❌ No | RNN, time series | | Adam | ✅ Yes | ✅ Yes | Default choice | | AdamW | ✅ Yes | ✅ Yes | Adam + weight decay |
Summary
Adam is the most widely used optimizer in deep learning. It combines momentum (smooth updates) with adaptive learning rates (per-parameter scaling).
Key takeaways: | Momentum: accumulates gradient history to smooth updates and escape local minima | | Adam = Momentum + RMSProp — best of both worlds | | Adam default params: lr=0.001, beta1=0.9, beta2=0.999 | | Bias correction: compensates for zero initialization in early steps | | Adaptive LR: each parameter has its own learning rate | | Momentum overcomes oscillation in narrow valleys | | Adam is the default optimizer for most deep learning tasks |
Next Chapter: Mini-Batch SGD
The next chapter explores stochastic gradient descent and batch strategies.