title: "Momentum & Adam Optimizer" description: "Implement Momentum, RMSProp, and Adam, compare convergence speeds." order: 2

Momentum & Adam

Momentum

Traditional GD oscillates in narrow valleys. Momentum adds inertia:

$$v_{t+1} = \beta v_t + \nabla L(w_t)$$ $$w_{t+1} = w_t - \eta v_{t+1}$$

$\beta$ (usually 0.9) controls inertia. Higher $\beta$ = smoother path but may overshoot.

RMSProp

Different parameters have different gradient scales. RMSProp maintains per-parameter learning rates:

$$s_{t+1} = \beta_2 s_t + (1-\beta_2)(\nabla L(w_t))^2$$ $$w_{t+1} = w_t - \frac{\eta}{\sqrt{s_{t+1}} + \epsilon} \nabla L(w_t)$$

  • Large gradients: learning rate auto-decreases
  • Small gradients: learning rate auto-increases

Adam = Momentum + RMSProp

Adam is the most commonly used optimizer, combining momentum with adaptive learning rates.

import numpy as np

def adam(grad, w0, lr=0.1, beta1=0.9, beta2=0.999, eps=1e-8, steps=100):
    w = w0.copy()
    m = np.zeros_like(w)  # momentum
    v = np.zeros_like(w)  # adaptive LR
    history = [w.copy()]
    for t in range(1, steps+1):
        g = grad(w)
        m = beta1*m + (1-beta1)*g
        v = beta2*v + (1-beta2)*g*g
        m_hat = m / (1-beta1**t)   # bias correction
        v_hat = v / (1-beta2**t)
        w = w - lr * m_hat / (np.sqrt(v_hat) + eps)
        history.append(w.copy())
    return w, history

Optimizer Comparison

| Feature | SGD | Momentum | RMSProp | Adam | |---------|:---:|:--------:|:-------:|:--------:| | Overcome oscillation | No | Yes | No | Yes | | Adaptive LR | No | No | Yes | Yes | | Inertia | No | Yes | No | Yes | | Bias correction | No | No | No | Yes | | Convergence speed | Slow | Medium | Medium | Fastest |

Selection Guide

| Scenario | Recommended | Reason | |----------|:-----------:|--------| | Simple convex | SGD / Momentum | No adaptive LR needed | | Computer Vision | SGD + Momentum | Better generalization | | NLP / Transformer | Adam / AdamW | Handles sparse gradients | | GAN / RL | Adam | Stabilizes unstable training | | Large models | AdamW | Adam + correct weight decay | | Resource limited | RMSProp | No momentum storage needed |


Key Takeaways

  • Momentum uses inertia to solve GD oscillation problem
  • RMSProp adapts learning rate per parameter
  • Adam = Momentum + RMSProp + bias correction, most universal optimizer
  • AdamW improves Adam for large language model training
  • No universal optimizer - choose based on problem characteristics

Next Chapter: SGD

Adam operates on mini-batches. The next chapter explores batch size impact on training.

Why Momentum?

Standard gradient descent can oscillate in narrow valleys and get stuck in local minima. Momentum solves this by adding a fraction of the previous update to the current update.

Momentum Formula

$$v_t = \beta v_{t-1} + \eta \nabla L(\theta_t)$$ $$\theta_{t+1} = \theta_t - v_t$$

Where:

  • $v_t$ = velocity (accumulated gradient)
  • $\beta$ = momentum coefficient (typically 0.9)
  • $\eta$ = learning rate
  • $\nabla L(\theta_t)$ = gradient at current parameters

Momentum vs Standard GD

| Aspect | Standard GD | Momentum GD | |--------|-------------|-------------| | Update | Directly follows gradient | Accumulates gradient history | | Oscillation | High in narrow valleys | Dampened by momentum | | Local minima | Easily trapped | Can roll through | | Convergence | Slow | 2-3× faster | | $\beta$ | N/A | 0.9 (typical) |

Adam Optimizer

Adam (Adaptive Moment Estimation) combines momentum with per-parameter adaptive learning rates.

Adam Formula

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$ $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

Adam Parameters

| Parameter | Typical Value | Effect | |-----------|---------------|--------| | $\eta$ | 0.001 | Learning rate | | $\beta_1$ | 0.9 | Momentum decay | | $\beta_2$ | 0.999 | Adaptive rate decay | | $\epsilon$ | $10^{-8}$ | Numerical stability |

Python Implementation

import numpy as np

class Adam:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.m = {}  # First moment
        self.v = {}  # Second moment
        self.t = 0   # Time step
    
    def update(self, params, grads):
        self.t += 1
        
        for key in params.keys():
            if key not in self.m:
                self.m[key] = np.zeros_like(grads[key])
                self.v[key] = np.zeros_like(grads[key])
            
            # Update biased moments
            self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
            self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key] ** 2)
            
            # Bias correction
            m_hat = self.m[key] / (1 - self.beta1 ** self.t)
            v_hat = self.v[key] / (1 - self.beta2 ** self.t)
            
            # Update parameters
            params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
        
        return params

# Usage example
adam = Adam(lr=0.001)
params = {'w': np.array([1.0, 2.0]), 'b': np.array([0.0])}
grads = {'w': np.array([0.1, 0.2]), 'b': np.array([0.05])}

for epoch in range(100):
    params = adam.update(params, grads)
    if epoch % 20 == 0:
        print(f"Epoch {epoch}: w={params['w']}, b={params['b']}")

Optimizer Comparison

| Optimizer | Adaptive LR | Momentum | Best For | |-----------|-------------|----------|----------| | SGD | ❌ No | ❌ No | Simple problems | | SGD + Momentum | ❌ No | ✅ Yes | Computer vision | | AdaGrad | ✅ Yes | ❌ No | Sparse data | | RMSProp | ✅ Yes | ❌ No | RNN, time series | | Adam | ✅ Yes | ✅ Yes | Default choice | | AdamW | ✅ Yes | ✅ Yes | Adam + weight decay |

Summary

Adam is the most widely used optimizer in deep learning. It combines momentum (smooth updates) with adaptive learning rates (per-parameter scaling).

Key takeaways: | Momentum: accumulates gradient history to smooth updates and escape local minima | | Adam = Momentum + RMSProp — best of both worlds | | Adam default params: lr=0.001, beta1=0.9, beta2=0.999 | | Bias correction: compensates for zero initialization in early steps | | Adaptive LR: each parameter has its own learning rate | | Momentum overcomes oscillation in narrow valleys | | Adam is the default optimizer for most deep learning tasks |

Next Chapter: Mini-Batch SGD

The next chapter explores stochastic gradient descent and batch strategies.

Unlock Full Tutorial

This chapter is paid content. Join the project to unlock over 5000 words of deep analysis, including 10+ god-tier Prompts and real Source Code examples!