title: "Momentum & Adam Optimizer" description: "Implement Momentum, RMSProp, and Adam, compare convergence speeds." order: 2

Momentum & Adam

Momentum

Traditional GD oscillates in narrow valleys. Momentum adds inertia:

$$v_{t+1} = \beta v_t + \nabla L(w_t)$$ $$w_{t+1} = w_t - \eta v_{t+1}$$

$\beta$ (usually 0.9) controls inertia. Higher $\beta$ = smoother path but may overshoot.

RMSProp

Different parameters have different gradient scales. RMSProp maintains per-parameter learning rates:

$$s_{t+1} = \beta_2 s_t + (1-\beta_2)(\nabla L(w_t))^2$$ $$w_{t+1} = w_t - \frac{\eta}{\sqrt{s_{t+1}} + \epsilon} \nabla L(w_t)$$

Large gradients: learning rate auto-decreases
Small gradients: learning rate auto-increases

Adam = Momentum + RMSProp

Adam is the most commonly used optimizer, combining momentum with adaptive learning rates.

import numpy as np

def adam(grad, w0, lr=0.1, beta1=0.9, beta2=0.999, eps=1e-8, steps=100):
    w = w0.copy()
    m = np.zeros_like(w)  # momentum
    v = np.zeros_like(w)  # adaptive LR
    history = [w.copy()]
    for t in range(1, steps+1):
        g = grad(w)
        m = beta1*m + (1-beta1)*g
        v = beta2*v + (1-beta2)*g*g
        m_hat = m / (1-beta1**t)   # bias correction
        v_hat = v / (1-beta2**t)
        w = w - lr * m_hat / (np.sqrt(v_hat) + eps)
        history.append(w.copy())
    return w, history

Optimizer Comparison

| Feature | SGD | Momentum | RMSProp | Adam | |---------|:---:|:--------:|:-------:|:--------:| | Overcome oscillation | No | Yes | No | Yes | | Adaptive LR | No | No | Yes | Yes | | Inertia | No | Yes | No | Yes | | Bias correction | No | No | No | Yes | | Convergence speed | Slow | Medium | Medium | Fastest |

Selection Guide

| Scenario | Recommended | Reason | |----------|:-----------:|--------| | Simple convex | SGD / Momentum | No adaptive LR needed | | Computer Vision | SGD + Momentum | Better generalization | | NLP / Transformer | Adam / AdamW | Handles sparse gradients | | GAN / RL | Adam | Stabilizes unstable training | | Large models | AdamW | Adam + correct weight decay | | Resource limited | RMSProp | No momentum storage needed |

Key Takeaways

Momentum uses inertia to solve GD oscillation problem
RMSProp adapts learning rate per parameter
Adam = Momentum + RMSProp + bias correction, most universal optimizer
AdamW improves Adam for large language model training
No universal optimizer - choose based on problem characteristics

Next Chapter: SGD

Adam operates on mini-batches. The next chapter explores batch size impact on training.

Why Momentum?

Standard gradient descent can oscillate in narrow valleys and get stuck in local minima. Momentum solves this by adding a fraction of the previous update to the current update.

Momentum Formula

$$v_t = \beta v_{t-1} + \eta \nabla L(\theta_t)$$ $$\theta_{t+1} = \theta_t - v_t$$

Where:

$v_t$ = velocity (accumulated gradient)
$\beta$ = momentum coefficient (typically 0.9)
$\eta$ = learning rate
$\nabla L(\theta_t)$ = gradient at current parameters

Momentum vs Standard GD

| Aspect | Standard GD | Momentum GD | |--------|-------------|-------------| | Update | Directly follows gradient | Accumulates gradient history | | Oscillation | High in narrow valleys | Dampened by momentum | | Local minima | Easily trapped | Can roll through | | Convergence | Slow | 2-3× faster | | $\beta$ | N/A | 0.9 (typical) |

Adam Optimizer

Adam (Adaptive Moment Estimation) combines momentum with per-parameter adaptive learning rates.

Adam Formula

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$ $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

Adam Parameters

| Parameter | Typical Value | Effect | |-----------|---------------|--------| | $\eta$ | 0.001 | Learning rate | | $\beta_1$ | 0.9 | Momentum decay | | $\beta_2$ | 0.999 | Adaptive rate decay | | $\epsilon$ | $10^{-8}$ | Numerical stability |

Python Implementation

import numpy as np

class Adam:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.m = {}  # First moment
        self.v = {}  # Second moment
        self.t = 0   # Time step
    
    def update(self, params, grads):
        self.t += 1
        
        for key in params.keys():
            if key not in self.m:
                self.m[key] = np.zeros_like(grads[key])
                self.v[key] = np.zeros_like(grads[key])
            
            # Update biased moments
            self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
            self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key] ** 2)
            
            # Bias correction
            m_hat = self.m[key] / (1 - self.beta1 ** self.t)
            v_hat = self.v[key] / (1 - self.beta2 ** self.t)
            
            # Update parameters
            params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
        
        return params

# Usage example
adam = Adam(lr=0.001)
params = {'w': np.array([1.0, 2.0]), 'b': np.array([0.0])}
grads = {'w': np.array([0.1, 0.2]), 'b': np.array([0.05])}

for epoch in range(100):
    params = adam.update(params, grads)
    if epoch % 20 == 0:
        print(f"Epoch {epoch}: w={params['w']}, b={params['b']}")

Optimizer Comparison

| Optimizer | Adaptive LR | Momentum | Best For | |-----------|-------------|----------|----------| | SGD | ❌ No | ❌ No | Simple problems | | SGD + Momentum | ❌ No | ✅ Yes | Computer vision | | AdaGrad | ✅ Yes | ❌ No | Sparse data | | RMSProp | ✅ Yes | ❌ No | RNN, time series | | Adam | ✅ Yes | ✅ Yes | Default choice | | AdamW | ✅ Yes | ✅ Yes | Adam + weight decay |

Summary

Adam is the most widely used optimizer in deep learning. It combines momentum (smooth updates) with adaptive learning rates (per-parameter scaling).

Key takeaways: | Momentum: accumulates gradient history to smooth updates and escape local minima | | Adam = Momentum + RMSProp — best of both worlds | | Adam default params: lr=0.001, beta1=0.9, beta2=0.999 | | Bias correction: compensates for zero initialization in early steps | | Adaptive LR: each parameter has its own learning rate | | Momentum overcomes oscillation in narrow valleys | | Adam is the default optimizer for most deep learning tasks |

Next Chapter: Mini-Batch SGD

The next chapter explores stochastic gradient descent and batch strategies.

title: "Momentum & Adam Optimizer" description: "Implement Momentum, RMSProp, and Adam, compare convergence speeds." order: 2

Momentum & Adam

Momentum

RMSProp

Adam = Momentum + RMSProp

Optimizer Comparison

Selection Guide

Key Takeaways

Next Chapter: SGD

Why Momentum?

Momentum Formula

Momentum vs Standard GD

Adam Optimizer

Adam Formula

Adam Parameters

Python Implementation

Optimizer Comparison

Summary

Next Chapter: Mini-Batch SGD

Unlock Full Tutorial