Bloom Filter: Space-Efficient Set Membership

Imagine you need to check whether an email address is in a list of 10 million spam addresses. Using a HashSet would require approximately 50 MB of memory. A Bloom Filter can do the same job with only 1.2 MB.

Why Bloom Filter?

In many real-world applications, you need to check if an element exists in a set:

Is this URL already crawled?
Is this email address known to be spam?
Has this username been taken?
Is this IP address in the blocklist?

Using a HashSet for millions of entries requires substantial memory. Bloom Filter solves this by trading a tiny false positive rate for massive memory savings.

What Is a Bloom Filter?

A Bloom Filter is a probabilistic data structure that can tell you:

Definitely not in the set (100% accurate - no false negatives)
Possibly in the set (has a false positive rate)

It uses a bit array of size m and k independent hash functions. When adding an element, it sets k bit positions to 1. When checking membership, if any of the k bits is 0, the element is definitely not in the set.

How Bloom Filter Works

Adding an Element

1. Take the element (e.g., "user@email.com")
2. Compute k hash values: h1(element), h2(element), ..., hk(element)
3. For each hash value, calculate position = hash % m
4. Set bit at each position to 1

Checking Membership

1. Compute k hash values for the element
2. For each hash value, check if the bit is 1
3. If any bit is 0 -> element is definitely NOT in the set
4. If all bits are 1 -> element is POSSIBLY in the set

Implementation

import math
import mmh3
from bitarray import bitarray

class BloomFilter:
    """Bloom Filter implementation with configurable false positive rate"""
    
    def __init__(self, n: int, fp_prob: float):
        """
        n: Expected number of elements to store
        fp_prob: Desired false positive probability (e.g., 0.01 = 1%)
        """
        self.fp_prob = fp_prob
        self.size = self._get_size(n, fp_prob)
        self.hash_count = self._get_hash_count(self.size, n)
        self.bit_array = bitarray(self.size)
        self.bit_array.setall(0)
    
    def _get_size(self, n: int, p: float) -> int:
        """Calculate optimal bit array size"""
        m = -(n * math.log(p)) / (math.log(2) ** 2)
        return int(m)
    
    def _get_hash_count(self, m: int, n: int) -> int:
        """Calculate optimal number of hash functions"""
        k = (m / n) * math.log(2)
        return int(k)
    
    def add(self, item: str) -> None:
        """Add an element to the filter"""
        for i in range(self.hash_count):
            digest = mmh3.hash(item, i) % self.size
            self.bit_array[digest] = True
    
    def check(self, item: str) -> bool:
        """
        Check if an element exists in the filter.
        Returns False if definitely not in set.
        Returns True if possibly in set.
        """
        for i in range(self.hash_count):
            digest = mmh3.hash(item, i) % self.size
            if not self.bit_array[digest]:
                return False
        return True

# Example usage
bf = BloomFilter(1000000, 0.01)  # 1M items, 1% FP rate
print(f"Bit array: {bf.size:,} bits ({bf.size/8/1024/1024:.2f} MB)")
print(f"Hash functions: {bf.hash_count}")

# Add some elements
for i in range(10000):
    bf.add(f"user_{i}@example.com")

# Test membership
print(f"Existing email: {bf.check('user_123@example.com')}")  # True
print(f"New email: {bf.check('new_user@test.com')}")  # Likely False

# Measure false positive rate
false_positives = sum(
    1 for i in range(10000)
    if bf.check(f"not_in_set_{i}@test.com")
)
print(f"False positive rate: {false_positives/10000:.4%} (target: 1%)")

Memory Comparison

| Data Structure | 1M Items | Memory | Savings | |---------------|:--------:|:------:|:-------:| | HashSet | 1M strings (15 bytes avg) | ~50 MB | - | | Bloom Filter (1% FP) | 1M items | ~1.2 MB | 97.6% | | Bloom Filter (0.1% FP) | 1M items | ~1.8 MB | 96.4% | | Bloom Filter (0.01% FP) | 1M items | ~2.4 MB | 95.2% |

The formula: to cut FP rate by 10x, you add about 2 bits per element.

Time Complexity

| Operation | Complexity | |-----------|:----------:| | add(item) | O(k) where k = number of hash functions | | check(item) | O(k) where k = number of hash functions | | Space | O(m) where m = bit array size |

Both operations are extremely fast since k is typically small (5-15).

Real-World Applications

1. Cache Filtering (Preventing Cache Penetration)

When a web server receives a request for a key, check Bloom Filter first. If the filter says the key does not exist, skip the cache lookup entirely. This prevents cache penetration attacks from overwhelming the database.

2. Web Crawler URL Deduplication

Search engines use Bloom Filter to track visited URLs. With billions of URLs, a Bloom Filter saves terabytes of RAM compared to HashSet. Google's web crawler uses this technique.

3. Spam Detection

Email services maintain Bloom Filters of known spam addresses. If a sender is "definitely not" in the filter, the email is delivered immediately. If "possibly," it triggers deeper inspection.

4. Database Query Optimization

Apache Cassandra uses Bloom Filter at the SSTable level to quickly skip files that do not contain the requested data, avoiding unnecessary disk I/O.

5. Blockchain Light Clients

Ethereum light wallets use Bloom Filter to check if a transaction or log exists in a block without downloading the entire block.

Limitations

| Limitation | Description | Workaround | |------------|-------------|------------| | No deletion | Standard BF cannot remove elements | Use Counting Bloom Filter | | False positives | Inherent trade-off for memory savings | Acceptable in most use cases | | FP rate grows | Adding more elements than expected n increases FP | Overallocate or rebuild | | Cannot enumerate | Cannot list stored elements | Maintain a separate list if needed |

Real-World Engineering Decision

Your API receives 10,000 requests per second and needs to check if each request IP is in a blocklist of 5 million IPs.

Option A: HashSet (100% accurate)

Memory: 5M x ~40 bytes = 200 MB
Query time: O(1) but 200 MB exceeds CPU cache
Latency: ~100 ns per query (RAM access)

Option B: Bloom Filter (1% false positive)

Memory: ~6 MB (saves 97%)
Query time: O(k) but fits entirely in CPU cache
Latency: ~10 ns per query (10x faster!)
Cost: 1% of legitimate IPs flagged (handle with secondary HashSet check)

This is a textbook example of when Bloom Filter is the right choice.

Summary

Bloom Filter is a space-efficient probabilistic data structure for set membership testing. It guarantees no false negatives while allowing configurable false positive rates.

Key takeaways:

Uses a bit array and k hash functions for space-efficient membership testing
No false negatives: if it says "not in set," the element is definitely not there
Configurable false positive rate: balance memory vs accuracy
k hash functions: typical range 5-15 depending on desired accuracy
Widely used in databases, caches, web crawlers, and blockchain
97%+ memory savings compared to HashSet for practical configurations

What's Next: HyperLogLog

Bloom Filter answers "Has this element appeared?" The next chapter covers HyperLogLog, which answers a different question: "How many distinct elements are there?" Using only 16 KB of memory, it can count billions of unique values with approximately 2% error.