HyperLogLog: Cardinality Estimation at Scale

Imagine you need to count how many unique visitors visited your website today. You have 100 million visits. Storing every unique visitor ID in a HashSet would require gigabytes of memory. HyperLogLog can do it with 16 KB.

Why Cardinality Estimation?

Many business questions involve counting distinct elements:

How many unique users visited today? (UV)
How many distinct search queries were made?
How many unique IPs accessed the server?
How many distinct products were viewed?

Exact counting requires storing every unique element seen. For large-scale data, this is prohibitively expensive.

What Is HyperLogLog?

HyperLogLog (HLL) is a probabilistic algorithm that estimates the number of distinct elements in a dataset using fixed, minimal memory. It was developed by Flajolet et al. and is one of the most elegant algorithms in computer science.

Key properties:

Fixed memory: Always uses 2^b bytes regardless of dataset size
Mergeable: Multiple HLLs can be combined for distributed counting
Single pass: Only needs one pass through the data
~2% error: Acceptable for most analytics use cases

How HyperLogLog Works (The Coin Flip Intuition)

Imagine flipping a fair coin. How many times would you need to flip to see 3 consecutive tails?

Expected: 2^3 = 8 flips
How about 10 consecutive tails? Expected: 2^10 = 1024 flips
How about 20 consecutive tails? Expected: 2^20 = 1,048,576 flips

HyperLogLog uses this principle. Each element is hashed into a binary number. The number of leading zeros in the hash is like counting consecutive tails. If the maximum number of leading zeros seen is 20, the dataset likely has about 2^20 = 1 million distinct elements.

Since a single measurement is unreliable, HLL divides data into m registers (buckets), computes an estimate per register, and takes the harmonic mean.

Implementation

import math
import mmh3

class HyperLogLog:
    """HyperLogLog implementation for cardinality estimation"""
    
    def __init__(self, b: int = 14):
        """
        b: Precision parameter (recommended 10-16)
        Memory = 2^b bytes
        b=10 -> 1 KB, error ~3.25%
        b=14 -> 16 KB, error ~0.81%
        b=16 -> 64 KB, error ~0.40%
        """
        self.b = b
        self.m = 1 << b  # Number of registers
        self.registers = [0] * self.m
        self.alpha = self._get_alpha()
    
    def _get_alpha(self) -> float:
        """Calculate bias correction constant"""
        if self.m == 16:
            return 0.673
        elif self.m == 32:
            return 0.697
        elif self.m == 64:
            return 0.709
        else:
            return 0.7213 / (1 + 1.079 / self.m)
    
    def _hash(self, item: str) -> int:
        """Compute 32-bit hash of the item"""
        return mmh3.hash(str(item)) & 0xFFFFFFFF
    
    def _count_leading_zeros(self, x: int) -> int:
        """Count leading zeros in a 32-bit integer"""
        if x == 0:
            return 32
        return (x.bit_length() ^ 31) + 1
    
    def add(self, item: str) -> None:
        """Add an element to the HLL"""
        x = self._hash(item)
        # First b bits determine the register index
        idx = x & (self.m - 1)
        # Remaining bits determine leading zero count
        w = x >> self.b
        # Update register with max leading zeros seen
        zeros = self._count_leading_zeros(w) + 1
        if zeros > self.registers[idx]:
            self.registers[idx] = zeros
    
    def count(self) -> int:
        """Estimate the number of distinct elements"""
        # Harmonic mean of 2^-register values
        sum_inv = sum(2.0 ** -r for r in self.registers)
        estimate = self.alpha * self.m * self.m / sum_inv
        
        # Small range correction
        if estimate <= 2.5 * self.m:
            zeros = self.registers.count(0)
            if zeros:
                estimate = self.m * math.log(self.m / zeros)
        
        # Large range correction
        elif estimate > (1 << 32) / 30:
            estimate = -(1 << 32) * math.log(1 - estimate / (1 << 32))
        
        return int(estimate)

# Test with 1 million elements
hll = HyperLogLog(b=14)  # 16 KB
print(f"Registers: {hll.m:,}")
print(f"Memory: {hll.m} bytes = {hll.m/1024:.1f} KB")

for i in range(1_000_000):
    hll.add(f"user_{i}_session_{i * 2}")

estimated = hll.count()
print(f"\nEstimated cardinality: {estimated:,}")
print(f"Actual cardinality: 1,000,000")
error = abs(estimated - 1_000_000) / 1_000_000 * 100
print(f"Error: {error:.2f}%")

Memory Comparison

| Algorithm | 1 Billion Items | Memory | Precision | |-----------|:--------------:|:------:|:---------:| | HashSet | 1B IPs | 16 GB | 100% | | HyperLogLog (b=10) | Any size | 1 KB | ~96.75% | | HyperLogLog (b=14) | Any size | 16 KB | ~98.60% | | HyperLogLog (b=16) | Any size | 64 KB | ~99.00% |

HLL Merge for Distributed Counting

def merge_hll(hlls: list) -> HyperLogLog:
    """Merge multiple HLLs for distributed counting"""
    merged = HyperLogLog(b=hlls[0].b)
    for h in hlls:
        for i in range(merged.m):
            merged.registers[i] = max(merged.registers[i], h.registers[i])
    return merged

# Example: 10 servers each with their own HLL
shard_hlls = []
for shard in range(10):
    h = HyperLogLog(b=14)
    for i in range(100_000):
        h.add(f"shard_{shard}_user_{i}")
    shard_hlls.append(h)

total = merge_hll(shard_hlls)
print(f"Total distinct users: {total.count():,}")

Real-World Application: E-Commerce UV During Sales Event

Scenario: Double 11 shopping festival. Need real-time unique visitor count every minute.

Peak traffic: 5 million requests per minute. UV per minute: approximately 2 million.

Option A: Redis Set (exact counting)

Memory per minute: 2M x ~36 bytes = 72 MB
Daily data: 72 MB x 1440 minutes = 103 GB
Impossible to keep all data in memory

Option B: Redis HyperLogLog

Memory per HLL: 12 KB
Daily data: 12 KB x 1440 = 17 MB
Error: approximately 0.81% (completely acceptable)

> PFADD uv:20241111:0930 user_12345
> PFADD uv:20241111:0930 user_67890
> PFCOUNT uv:20241111:0930
(integer) 1845327  # Estimated, ~0.8% error

# Merge hourly data into daily
> PFMERGE uv:20241111 uv:20241111:09 uv:20241111:10 uv:20241111:11
> PFCOUNT uv:20241111
(integer) 8934561  # Daily unique visitors

Result: 16 GB vs 17 MB — nearly 1000x memory savings!

Advantages

| Advantage | Description | |-----------|-------------| | Fixed memory | Always 2^b bytes regardless of data size | | Mergeable | Combine multiple HLLs without data loss | | Single pass | One scan through data, suitable for streams | | Streaming | Works on infinite data streams | | Parallelizable | Shard data -> each shard builds HLL -> merge |

Use Cases

Website UV statistics: Billions of visits, 16 KB for real-time counting
Database query optimization: Estimate SELECT DISTINCT size for query planning
Network monitoring: Count distinct IP connections in real-time
Crawler deduplication: Track crawled URLs without storing all URLs
Clickstream analysis: Count distinct product views and search queries

Summary

HyperLogLog is a brilliant algorithm that estimates distinct element counts using minimal, fixed memory. Its mergeability makes it ideal for distributed systems.

Key takeaways:

Estimates cardinality using 1-64 KB regardless of dataset size
Based on counting leading zeros in hash values (coin flip analogy)
Approximately 2% error, acceptable for most analytics use cases
Mergeable: ideal for distributed MapReduce-style counting
Widely used in Redis, PostgreSQL, Google BigQuery, AWS
1000x memory savings compared to exact HashSet counting

What's Next: Count-Min Sketch

HyperLogLog answers "How many distinct elements?" The next chapter covers Count-Min Sketch, which answers a different question: "How many times has each element appeared?" Using about 5 KB of memory, it estimates frequencies in streaming data with bounded error.