HyperLogLog: Cardinality Estimation at Scale
Imagine you need to count how many unique visitors visited your website today. You have 100 million visits. Storing every unique visitor ID in a HashSet would require gigabytes of memory. HyperLogLog can do it with 16 KB.
Why Cardinality Estimation?
Many business questions involve counting distinct elements:
- How many unique users visited today? (UV)
- How many distinct search queries were made?
- How many unique IPs accessed the server?
- How many distinct products were viewed?
Exact counting requires storing every unique element seen. For large-scale data, this is prohibitively expensive.
What Is HyperLogLog?
HyperLogLog (HLL) is a probabilistic algorithm that estimates the number of distinct elements in a dataset using fixed, minimal memory. It was developed by Flajolet et al. and is one of the most elegant algorithms in computer science.
Key properties:
- Fixed memory: Always uses 2^b bytes regardless of dataset size
- Mergeable: Multiple HLLs can be combined for distributed counting
- Single pass: Only needs one pass through the data
- ~2% error: Acceptable for most analytics use cases
How HyperLogLog Works (The Coin Flip Intuition)
Imagine flipping a fair coin. How many times would you need to flip to see 3 consecutive tails?
- Expected: 2^3 = 8 flips
- How about 10 consecutive tails? Expected: 2^10 = 1024 flips
- How about 20 consecutive tails? Expected: 2^20 = 1,048,576 flips
HyperLogLog uses this principle. Each element is hashed into a binary number. The number of leading zeros in the hash is like counting consecutive tails. If the maximum number of leading zeros seen is 20, the dataset likely has about 2^20 = 1 million distinct elements.
Since a single measurement is unreliable, HLL divides data into m registers (buckets), computes an estimate per register, and takes the harmonic mean.
Implementation
import math
import mmh3
class HyperLogLog:
"""HyperLogLog implementation for cardinality estimation"""
def __init__(self, b: int = 14):
"""
b: Precision parameter (recommended 10-16)
Memory = 2^b bytes
b=10 -> 1 KB, error ~3.25%
b=14 -> 16 KB, error ~0.81%
b=16 -> 64 KB, error ~0.40%
"""
self.b = b
self.m = 1 << b # Number of registers
self.registers = [0] * self.m
self.alpha = self._get_alpha()
def _get_alpha(self) -> float:
"""Calculate bias correction constant"""
if self.m == 16:
return 0.673
elif self.m == 32:
return 0.697
elif self.m == 64:
return 0.709
else:
return 0.7213 / (1 + 1.079 / self.m)
def _hash(self, item: str) -> int:
"""Compute 32-bit hash of the item"""
return mmh3.hash(str(item)) & 0xFFFFFFFF
def _count_leading_zeros(self, x: int) -> int:
"""Count leading zeros in a 32-bit integer"""
if x == 0:
return 32
return (x.bit_length() ^ 31) + 1
def add(self, item: str) -> None:
"""Add an element to the HLL"""
x = self._hash(item)
# First b bits determine the register index
idx = x & (self.m - 1)
# Remaining bits determine leading zero count
w = x >> self.b
# Update register with max leading zeros seen
zeros = self._count_leading_zeros(w) + 1
if zeros > self.registers[idx]:
self.registers[idx] = zeros
def count(self) -> int:
"""Estimate the number of distinct elements"""
# Harmonic mean of 2^-register values
sum_inv = sum(2.0 ** -r for r in self.registers)
estimate = self.alpha * self.m * self.m / sum_inv
# Small range correction
if estimate <= 2.5 * self.m:
zeros = self.registers.count(0)
if zeros:
estimate = self.m * math.log(self.m / zeros)
# Large range correction
elif estimate > (1 << 32) / 30:
estimate = -(1 << 32) * math.log(1 - estimate / (1 << 32))
return int(estimate)
# Test with 1 million elements
hll = HyperLogLog(b=14) # 16 KB
print(f"Registers: {hll.m:,}")
print(f"Memory: {hll.m} bytes = {hll.m/1024:.1f} KB")
for i in range(1_000_000):
hll.add(f"user_{i}_session_{i * 2}")
estimated = hll.count()
print(f"\nEstimated cardinality: {estimated:,}")
print(f"Actual cardinality: 1,000,000")
error = abs(estimated - 1_000_000) / 1_000_000 * 100
print(f"Error: {error:.2f}%")
Memory Comparison
| Algorithm | 1 Billion Items | Memory | Precision | |-----------|:--------------:|:------:|:---------:| | HashSet | 1B IPs | 16 GB | 100% | | HyperLogLog (b=10) | Any size | 1 KB | ~96.75% | | HyperLogLog (b=14) | Any size | 16 KB | ~98.60% | | HyperLogLog (b=16) | Any size | 64 KB | ~99.00% |
HLL Merge for Distributed Counting
def merge_hll(hlls: list) -> HyperLogLog:
"""Merge multiple HLLs for distributed counting"""
merged = HyperLogLog(b=hlls[0].b)
for h in hlls:
for i in range(merged.m):
merged.registers[i] = max(merged.registers[i], h.registers[i])
return merged
# Example: 10 servers each with their own HLL
shard_hlls = []
for shard in range(10):
h = HyperLogLog(b=14)
for i in range(100_000):
h.add(f"shard_{shard}_user_{i}")
shard_hlls.append(h)
total = merge_hll(shard_hlls)
print(f"Total distinct users: {total.count():,}")
Real-World Application: E-Commerce UV During Sales Event
Scenario: Double 11 shopping festival. Need real-time unique visitor count every minute.
Peak traffic: 5 million requests per minute. UV per minute: approximately 2 million.
Option A: Redis Set (exact counting)
- Memory per minute: 2M x ~36 bytes = 72 MB
- Daily data: 72 MB x 1440 minutes = 103 GB
- Impossible to keep all data in memory
Option B: Redis HyperLogLog
- Memory per HLL: 12 KB
- Daily data: 12 KB x 1440 = 17 MB
- Error: approximately 0.81% (completely acceptable)
> PFADD uv:20241111:0930 user_12345
> PFADD uv:20241111:0930 user_67890
> PFCOUNT uv:20241111:0930
(integer) 1845327 # Estimated, ~0.8% error
# Merge hourly data into daily
> PFMERGE uv:20241111 uv:20241111:09 uv:20241111:10 uv:20241111:11
> PFCOUNT uv:20241111
(integer) 8934561 # Daily unique visitors
Result: 16 GB vs 17 MB โ nearly 1000x memory savings!
Advantages
| Advantage | Description | |-----------|-------------| | Fixed memory | Always 2^b bytes regardless of data size | | Mergeable | Combine multiple HLLs without data loss | | Single pass | One scan through data, suitable for streams | | Streaming | Works on infinite data streams | | Parallelizable | Shard data -> each shard builds HLL -> merge |
Use Cases
- Website UV statistics: Billions of visits, 16 KB for real-time counting
- Database query optimization: Estimate SELECT DISTINCT size for query planning
- Network monitoring: Count distinct IP connections in real-time
- Crawler deduplication: Track crawled URLs without storing all URLs
- Clickstream analysis: Count distinct product views and search queries
Summary
HyperLogLog is a brilliant algorithm that estimates distinct element counts using minimal, fixed memory. Its mergeability makes it ideal for distributed systems.
Key takeaways:
- Estimates cardinality using 1-64 KB regardless of dataset size
- Based on counting leading zeros in hash values (coin flip analogy)
- Approximately 2% error, acceptable for most analytics use cases
- Mergeable: ideal for distributed MapReduce-style counting
- Widely used in Redis, PostgreSQL, Google BigQuery, AWS
- 1000x memory savings compared to exact HashSet counting
What's Next: Count-Min Sketch
HyperLogLog answers "How many distinct elements?" The next chapter covers Count-Min Sketch, which answers a different question: "How many times has each element appeared?" Using about 5 KB of memory, it estimates frequencies in streaming data with bounded error.