Count-Min Sketch: Streaming Frequency Estimation

Imagine you need to count how many times each hashtag appears on Twitter in real-time. Millions of tweets per minute, thousands of unique hashtags. Using a HashMap would require significant memory and grow without bound. Count-Min Sketch does it with approximately 5 KB.

Why Frequency Estimation?

Many real-time analytics questions involve counting frequencies:

What are the trending hashtags right now?
Which products are being viewed most frequently?
Which IP addresses are making the most requests?
What are the most common error types in logs?

Exact frequency counting requires storing a counter for every unique item. For high-cardinality streams, this is impractical.

What Is Count-Min Sketch?

Count-Min Sketch (CMS) is a probabilistic data structure that estimates the frequency of elements in a stream. It uses a 2D array of counters with d rows (depth) and w columns (width), where each row uses an independent hash function.

Key properties:

Always overestimates: CMS never underestimates a frequency (due to hash collisions)
Bounded error: With probability 1-delta, error <= epsilon x N
Fixed memory: Memory determined by epsilon and delta, not data size
Mergeable: Multiple CMS can be combined

How Count-Min Sketch Works

Data Structure

      width = e/epsilon
     ┌────────────────────┐
row 0│ 0 0 0 0 ... 0 0 0 │ ← hash function h1
row 1│ 0 0 0 0 ... 0 0 0 │ ← hash function h2
row 2│ 0 0 0 0 ... 0 0 0 │ ← hash function h3
  ⋮  │        ...         │
row d│ 0 0 0 0 ... 0 0 0 │ ← hash function hd
     └────────────────────┘

Adding an Element

1. For each row r (0 to d-1):
   a. Compute hash = hr(item)
   b. column = hash % width
   c. Increment table[r][column] by 1

Estimating Frequency

1. For each row r:
   a. Compute hash = hr(item)
   b. column = hash % width
   c. Get estimate = table[r][column]
2. Return minimum across all rows

Why take the minimum? CMS always overestimates because hash collisions cause multiple elements to share the same counter. Taking the minimum across rows gives the estimate with the fewest collisions.

Implementation

import math
import mmh3

class CountMinSketch:
    """Count-Min Sketch for streaming frequency estimation"""
    
    def __init__(self, epsilon: float = 0.01, delta: float = 0.01):
        """
        epsilon: Error bound (as fraction of total count)
        delta: Probability of exceeding error bound
        """
        self.width = int(math.ceil(math.e / epsilon))
        self.depth = int(math.ceil(math.log(1 / delta)))
        self.table = [[0] * self.width for _ in range(self.depth)]
        self.total = 0
    
    def add(self, item: str, count: int = 1) -> None:
        """Add element occurrence(s) to the sketch"""
        for d in range(self.depth):
            idx = mmh3.hash(str(item), d) % self.width
            self.table[d][idx] += count
        self.total += count
    
    def estimate(self, item: str) -> int:
        """Estimate frequency of an element"""
        estimates = []
        for d in range(self.depth):
            idx = mmh3.hash(str(item), d) % self.width
            estimates.append(self.table[d][idx])
        return min(estimates)
    
    def merge(self, other: "CountMinSketch") -> None:
        """Merge another CMS into this one (distributed counting)"""
        for d in range(self.depth):
            for w in range(self.width):
                self.table[d][w] += other.table[d][w]
        self.total += other.total

# Test with simulated stream
test_cms = CountMinSketch(epsilon=0.01, delta=0.01)
print(f"Table: {test_cms.width}x{test_cms.depth} = {test_cms.width * test_cms.depth} counters")
print(f"Memory: {test_cms.width * test_cms.depth * 4 / 1024:.2f} KB (32-bit ints)")

import random
random.seed(42)

true_counts = {}
for _ in range(100_000):
    item = f"item_{random.randint(1, 1000)}"
    test_cms.add(item)
    true_counts[item] = true_counts.get(item, 0) + 1

# Evaluate accuracy
print(f"\n=== Frequency Estimation Accuracy ===")
errors = []
for item, true_count in list(true_counts.items())[:20]:
    est = test_cms.estimate(item)
    error = est - true_count
    errors.append(error)
    print(f"{item}: true={true_count}, est={est}, error={error}")

print(f"\nAverage error: {sum(errors)/len(errors):.2f}")
print(f"Max error: {max(errors)}")
print(f"Theory guarantees: error <= epsilon * N = {0.01 * 100000} = 1000")

Mathematical Guarantee

With probablity 1 - delta:

Actual frequency <= Estimated frequency <= Actual frequency + epsilon x N

Where:
  epsilon = error parameter (e.g., 0.01)
  delta = confidence parameter (e.g., 0.01)
  N = total number of items processed

Example: epsilon=0.01, delta=0.01, N=100,000
  Width = e/0.01 = 272
  Depth = ln(1/0.01) = 5
  Memory = 272 x 5 = 1,360 counters = 5.3 KB (32-bit ints)
  Error <= 0.01 x 100,000 = 1000

Memory Comparison

| Data Structure | 100K Items | Memory | Error | |---------------|:----------:|:------:|:----:| | HashMap | 100K entries | ~5 MB | 0% | | CMS (1% error, 99% confidence) | Any stream | ~5 KB | <= 1% | | CMS (0.1% error, 99.9% confidence) | Any stream | ~50 KB | <= 0.1% |

Real-World Applications

1. Trending Hashtags on Twitter

Twitter streams 500,000+ tweets per minute. CMS tracks hashtag frequencies with ~5 KB memory. Combined with a min-heap, it maintains the top-k trending hashtags in real-time.

2. DDoS Detection

CDN providers monitor request counts per source IP. When estimated requests exceed a threshold (10x normal), the system auto-triggers rate limiting or blocking.

3. E-Commerce Trending Products

E-commerce platforms count product views and purchases in real-time. CMS identifies trending products with minimal memory, feeding data to recommendation systems.

4. Database Query Optimization

Databases estimate GROUP BY and DISTINCT result sizes using CMS before execution, selecting the optimal query strategy.

5. IoT Sensor Data

Edge devices (Raspberry Pi, ESP32) use CMS to count sensor reading distributions locally, uploading only compact sketches to the cloud instead of raw data streams.

Variants

| Variant | Description | |---------|-------------| | Count-Mean-Min Sketch | Subtracts estimated noise from other items for better accuracy | | Conservative Update | Only increments if the counter is currently the minimum across rows | | Heavy Hitters | CMS + min-heap to track top-k most frequent items | | Count-Min Sketch with Aging | Decays older counts for time-windowed frequency analysis |

Comparison with Other Probabilistic Structures

| Structure | Function | Memory Ratio | |-----------|----------|:------------:| | Bloom Filter | Set membership | ~1.2 MB/1M items (1% FP) | | HyperLogLog | Cardinality | ~16 KB (2% error) | | Count-Min Sketch | Frequency | ~5 KB/100K items (1% error) | | MinHash | Set similarity | ~2 KB (5% error) |

Summary

Count-Min Sketch is a powerful tool for real-time frequency estimation in streaming data. Its fixed memory usage and mergeability make it ideal for large-scale distributed systems.

Key takeaways:

Estimates element frequencies using a 2D counter array with d hash functions
Always overestimates (never underestimates) - takes minimum across rows
Error bounded by epsilon x N with confidence 1-delta
Fixed memory: about (e/epsilon) x ln(1/delta) counters
Distributed-friendly: multiple sketches can be merged
Widely used by Twitter, Reddit, Cloudflare, Apache DataSketches

What's Next: FAISS Vector Search

We have covered three probabilistic structures for streaming data. The next chapter shifts to a different problem: vector similarity search using FAISS, the engine behind recommendation systems and semantic search.