Reservoir Sampling

🔥 Vibe Prompt

"Sample 100 random users from 10M log entries in O(n) with O(k) memory."

import random

def reservoir_sample(stream, k):
    """Sample k items from an iterable stream"""
    reservoir = []
    for i, item in enumerate(stream):
        if i < k:
            reservoir.append(item)
        else:
            j = random.randint(0, i)
            if j < k:
                reservoir[j] = item
    return reservoir

# Simulate 10M stream
def log_stream():
    for i in range(10000000):
        yield f"user_{i}_session_{i % 1000}"

import time
start = time.time()
sample = reservoir_sample(log_stream(), 100)
elapsed = time.time() - start

print(f"Sampled {len(sample)} users from 10M")
print(f"Time: {elapsed:.2f}s")
print(f"Memory: O(100) items")
print(f"Sample: {sample[:5]}...")

Applications

Real-time web analytics sampling
A/B test assignment
Network packet sampling
ML training data selection

Proof

Each item has probability k/n of being selected. Reservoir ensures uniform sampling without knowing n in advance.

Key Points

Understand the core concepts thoroughly
Practice with hands-on code examples
Apply knowledge to real-world problems
Review and reinforce through exercises

Further Learning

Official documentation
Open source projects on GitHub
Community forums and discussions
Related courses and tutorials

水庫抽樣：你不知道資料多大時怎麼辦？

想像你是一個串流平台（如 Netflix）的工程師，每秒鐘有上千個使用者登入。你想從所有線上使用者中隨機選 10 個人做問卷調查——但你不知道總共有多少人在線上。

這就是水庫抽樣（Reservoir Sampling）要解決的問題。

演算法的美

步驟：
1. 先把前 k 個元素放進水庫（陣列）
2. 遇到第 i 個元素時，以 k/i 的機率決定是否取代水庫中的隨機一個
3. 走完整個串流後，水庫中的 k 個元素就是均勻隨機樣本

這演算法神奇的地方：不管串流有 100 筆還是 100 億筆資料，每個元素被選中的機率都是 k/n。

在真實世界的應用

| 場景 | 為什麼用 Reservoir Sampling | |:----|:--------------------------| | 即時資料串流 | 資料量無限，無法全部存下來 | | 抽樣稽核 | 從每天數百萬筆交易中抽樣檢查是否有異常 | | A/B 測試 | 從即時流量中隨機分配使用者到實驗組/對照組 | | 負載監控 | 從大量請求中取樣分析延遲分布 |

下一章預告

水庫抽樣是隨機化演算法中實用性極高的一個工具。課程最後一章我們將討論機率分析與集中不等式——這些數學工具能幫你證明為什麼這些隨機化演算法真的有效。