HyperLogLog: Cardinality Estimation at Scale

Imagine you need to count how many unique visitors visited your website today. You have 100 million visits. Storing every unique visitor ID in a HashSet would require gigabytes of memory. HyperLogLog can do it with 16 KB.

Why Cardinality Estimation?

Many business questions involve counting distinct elements:

  • How many unique users visited today? (UV)
  • How many distinct search queries were made?
  • How many unique IPs accessed the server?
  • How many distinct products were viewed?

Exact counting requires storing every unique element seen. For large-scale data, this is prohibitively expensive.

What Is HyperLogLog?

HyperLogLog (HLL) is a probabilistic algorithm that estimates the number of distinct elements in a dataset using fixed, minimal memory. It was developed by Flajolet et al. and is one of the most elegant algorithms in computer science.

Key properties:

  • Fixed memory: Always uses 2^b bytes regardless of dataset size
  • Mergeable: Multiple HLLs can be combined for distributed counting
  • Single pass: Only needs one pass through the data
  • ~2% error: Acceptable for most analytics use cases

How HyperLogLog Works (The Coin Flip Intuition)

Imagine flipping a fair coin. How many times would you need to flip to see 3 consecutive tails?

  • Expected: 2^3 = 8 flips
  • How about 10 consecutive tails? Expected: 2^10 = 1024 flips
  • How about 20 consecutive tails? Expected: 2^20 = 1,048,576 flips

HyperLogLog uses this principle. Each element is hashed into a binary number. The number of leading zeros in the hash is like counting consecutive tails. If the maximum number of leading zeros seen is 20, the dataset likely has about 2^20 = 1 million distinct elements.

Since a single measurement is unreliable, HLL divides data into m registers (buckets), computes an estimate per register, and takes the harmonic mean.

Implementation

import math
import mmh3

class HyperLogLog:
    """HyperLogLog implementation for cardinality estimation"""
    
    def __init__(self, b: int = 14):
        """
        b: Precision parameter (recommended 10-16)
        Memory = 2^b bytes
        b=10 -> 1 KB, error ~3.25%
        b=14 -> 16 KB, error ~0.81%
        b=16 -> 64 KB, error ~0.40%
        """
        self.b = b
        self.m = 1 << b  # Number of registers
        self.registers = [0] * self.m
        self.alpha = self._get_alpha()
    
    def _get_alpha(self) -> float:
        """Calculate bias correction constant"""
        if self.m == 16:
            return 0.673
        elif self.m == 32:
            return 0.697
        elif self.m == 64:
            return 0.709
        else:
            return 0.7213 / (1 + 1.079 / self.m)
    
    def _hash(self, item: str) -> int:
        """Compute 32-bit hash of the item"""
        return mmh3.hash(str(item)) & 0xFFFFFFFF
    
    def _count_leading_zeros(self, x: int) -> int:
        """Count leading zeros in a 32-bit integer"""
        if x == 0:
            return 32
        return (x.bit_length() ^ 31) + 1
    
    def add(self, item: str) -> None:
        """Add an element to the HLL"""
        x = self._hash(item)
        # First b bits determine the register index
        idx = x & (self.m - 1)
        # Remaining bits determine leading zero count
        w = x >> self.b
        # Update register with max leading zeros seen
        zeros = self._count_leading_zeros(w) + 1
        if zeros > self.registers[idx]:
            self.registers[idx] = zeros
    
    def count(self) -> int:
        """Estimate the number of distinct elements"""
        # Harmonic mean of 2^-register values
        sum_inv = sum(2.0 ** -r for r in self.registers)
        estimate = self.alpha * self.m * self.m / sum_inv
        
        # Small range correction
        if estimate <= 2.5 * self.m:
            zeros = self.registers.count(0)
            if zeros:
                estimate = self.m * math.log(self.m / zeros)
        
        # Large range correction
        elif estimate > (1 << 32) / 30:
            estimate = -(1 << 32) * math.log(1 - estimate / (1 << 32))
        
        return int(estimate)

# Test with 1 million elements
hll = HyperLogLog(b=14)  # 16 KB
print(f"Registers: {hll.m:,}")
print(f"Memory: {hll.m} bytes = {hll.m/1024:.1f} KB")

for i in range(1_000_000):
    hll.add(f"user_{i}_session_{i * 2}")

estimated = hll.count()
print(f"\nEstimated cardinality: {estimated:,}")
print(f"Actual cardinality: 1,000,000")
error = abs(estimated - 1_000_000) / 1_000_000 * 100
print(f"Error: {error:.2f}%")

Memory Comparison

| Algorithm | 1 Billion Items | Memory | Precision | |-----------|:--------------:|:------:|:---------:| | HashSet | 1B IPs | 16 GB | 100% | | HyperLogLog (b=10) | Any size | 1 KB | ~96.75% | | HyperLogLog (b=14) | Any size | 16 KB | ~98.60% | | HyperLogLog (b=16) | Any size | 64 KB | ~99.00% |

HLL Merge for Distributed Counting

def merge_hll(hlls: list) -> HyperLogLog:
    """Merge multiple HLLs for distributed counting"""
    merged = HyperLogLog(b=hlls[0].b)
    for h in hlls:
        for i in range(merged.m):
            merged.registers[i] = max(merged.registers[i], h.registers[i])
    return merged

# Example: 10 servers each with their own HLL
shard_hlls = []
for shard in range(10):
    h = HyperLogLog(b=14)
    for i in range(100_000):
        h.add(f"shard_{shard}_user_{i}")
    shard_hlls.append(h)

total = merge_hll(shard_hlls)
print(f"Total distinct users: {total.count():,}")

Real-World Application: E-Commerce UV During Sales Event

Scenario: Double 11 shopping festival. Need real-time unique visitor count every minute.

Peak traffic: 5 million requests per minute. UV per minute: approximately 2 million.

Option A: Redis Set (exact counting)

  • Memory per minute: 2M x ~36 bytes = 72 MB
  • Daily data: 72 MB x 1440 minutes = 103 GB
  • Impossible to keep all data in memory

Option B: Redis HyperLogLog

  • Memory per HLL: 12 KB
  • Daily data: 12 KB x 1440 = 17 MB
  • Error: approximately 0.81% (completely acceptable)
> PFADD uv:20241111:0930 user_12345
> PFADD uv:20241111:0930 user_67890
> PFCOUNT uv:20241111:0930
(integer) 1845327  # Estimated, ~0.8% error

# Merge hourly data into daily
> PFMERGE uv:20241111 uv:20241111:09 uv:20241111:10 uv:20241111:11
> PFCOUNT uv:20241111
(integer) 8934561  # Daily unique visitors

Result: 16 GB vs 17 MB โ€” nearly 1000x memory savings!

Advantages

| Advantage | Description | |-----------|-------------| | Fixed memory | Always 2^b bytes regardless of data size | | Mergeable | Combine multiple HLLs without data loss | | Single pass | One scan through data, suitable for streams | | Streaming | Works on infinite data streams | | Parallelizable | Shard data -> each shard builds HLL -> merge |

Use Cases

  • Website UV statistics: Billions of visits, 16 KB for real-time counting
  • Database query optimization: Estimate SELECT DISTINCT size for query planning
  • Network monitoring: Count distinct IP connections in real-time
  • Crawler deduplication: Track crawled URLs without storing all URLs
  • Clickstream analysis: Count distinct product views and search queries

Summary

HyperLogLog is a brilliant algorithm that estimates distinct element counts using minimal, fixed memory. Its mergeability makes it ideal for distributed systems.

Key takeaways:

  • Estimates cardinality using 1-64 KB regardless of dataset size
  • Based on counting leading zeros in hash values (coin flip analogy)
  • Approximately 2% error, acceptable for most analytics use cases
  • Mergeable: ideal for distributed MapReduce-style counting
  • Widely used in Redis, PostgreSQL, Google BigQuery, AWS
  • 1000x memory savings compared to exact HashSet counting

What's Next: Count-Min Sketch

HyperLogLog answers "How many distinct elements?" The next chapter covers Count-Min Sketch, which answers a different question: "How many times has each element appeared?" Using about 5 KB of memory, it estimates frequencies in streaming data with bounded error.

Member Exclusive Free Tutorial

This chapter is free exclusive content for registered members! Please login or register to unlock immediately.

Login / Register Now