Defeating Cache Penetration: Why I Switched to Redis Bloom and Cuckoo Filters

Database tutorial - IT technology blog
Database tutorial - IT technology blog

The Day My Database Died: A Real-World Cache Nightmare

Back in 2021, I was running the backend for a mid-sized e-commerce site during a major flash sale. Everything seemed stable until a massive wave of traffic hit. But this wasn’t a rush of eager shoppers. It was a targeted bot attack. These bots weren’t interested in our discounted inventory; they were querying millions of random, non-existent product IDs at a rate of roughly 50,000 requests per second.

Our architecture followed a standard pattern. The app checked Redis, and if it found nothing (a cache miss), it hammered our PostgreSQL database. Since the bots were guessing IDs that never existed, Redis consistently returned null. Every single malicious request bypassed the cache and hit the disk. Within three minutes, database CPU usage pegged at 100%. The site crashed. This is the textbook definition of Cache Penetration.

Why Standard Caching Fails Under Pressure

Caches are high-speed mirrors. They excel at storing what is there. Unfortunately, they are terrible at telling you what isn’t there without eating up your entire budget in RAM.

I tried “negative caching” first. I stored null values for those non-existent IDs in Redis with a 5-minute TTL. It was a disaster. The bots just generated 15 million new unique strings every hour. My Redis memory usage exploded from 2GB to 14GB in a single morning. I wasn’t solving the problem; I was just moving the bottleneck. I needed a way to verify if a key existed before bugging the database, and I needed to do it cheaply.

Probabilistic Filters to the Rescue

To fix this without a massive AWS bill, I turned to probabilistic data structures: Bloom Filters and Cuckoo Filters. These filters don’t store the raw data. Instead, they use compact mathematical fingerprints to tell you if an item is “definitely not there” or “probably there.”

Bloom Filters: The Reliable Veteran

A Bloom Filter uses a bit array and several hash functions. When you add an item, the filter hashes it multiple times and flips specific bits to 1. It’s incredibly lean—you can represent 1 million items in just 1.2MB with a 1% error rate.

  • Definitely No: If you check an ID and even one hash bit is 0, the item is 100% missing. You can kill the request instantly.
  • Maybe Yes: If all bits are 1, the item might be in the database. There’s a tiny chance (a False Positive) that different items shared the same bits.

Cuckoo Filters: The Modern Alternative

Eventually, I moved toward Cuckoo Filters. They use a technique called cuckoo hashing to store fingerprints. The biggest win? Cuckoo Filters support deletions. If you delete a product from your DB, you can actually remove it from the filter. With a standard Bloom Filter, you’d have to wipe the whole thing and rebuild it from scratch to get rid of a single entry.

Bloom vs. Cuckoo: Choosing Your Weapon

Picking the right tool depends entirely on your data’s lifecycle. Here is how I weigh them during architectural reviews:

Feature Bloom Filter Cuckoo Filter
Memory Efficiency Beats Cuckoo when you need <1% false positives. Better for higher error tolerances (>3%).
Deletion Support No. Yes.
Speed Lightning fast (O(k) hash lookups). Fast, but lookups slow down as the filter fills past 80%.
Simplicity Rock solid and well-understood. Slightly more complex implementation.

If your dataset is static—like a list of blocked IP addresses—Bloom is perfect. If you’re constantly adding and removing SKUs, Cuckoo is the clear winner.

Practical Implementation with Redis

Native Redis doesn’t support these out of the box, but Redis Stack includes the RedisBloom module. It handles all the complex math for you. You can fire up a test instance with a single Docker command:

bash
docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest

Deploying a Bloom Filter

I usually reserve a filter with a 1% error rate and a 1-million item capacity to start:

bash
# BF.RESERVE {key} {error_rate} {capacity}
BF.RESERVE product_filter 0.01 1000000

Adding and checking items takes microseconds:

bash
# Add an ID
BF.ADD product_filter "prod_99"

# Check if an ID exists
BF.EXISTS product_filter "prod_99"  # Returns 1 (Found it)
BF.EXISTS product_filter "random_bot_id" # Returns 0 (Block it!)

Deploying a Cuckoo Filter

The syntax for Cuckoo is almost identical, but notice the DEL command:

bash
CF.RESERVE user_filter 1000000
CF.ADD user_filter "user_123"
CF.DEL user_filter "user_123" # This is the superpower.

Pro-Tip for Data Prep

Loading millions of existing IDs into Redis can be tedious. I often have to scrub massive CSV exports from legacy SQL dumps before importing them. When I don’t feel like writing a Python script for a one-off task, I use toolcraft.app/en/tools/data/csv-to-json. It processes everything in the browser. It’s a quick way to format data for your Redis population scripts without worrying about your data hitting a third-party server.

Building Your Best Defense

Logic is key. In your backend—whether it’s Node, Python, or Go—treat the filter as your security guard. Don’t let the request pass the front door unless the filter gives the green light:

python
def get_product(product_id):
    # 1. The Security Guard: Check Bloom Filter
    if not redis.execute_command('BF.EXISTS', 'product_filter', product_id):
        return None # Stop the attack here!

    # 2. The Express Lane: Check standard Redis cache
    product = redis.get(f"product:{product_id}")
    if product:
        return product

    # 3. The Warehouse: Query Database as a last resort
    product = db.query("SELECT * FROM products WHERE id = %s", (product_id,))
    if product:
        redis.set(f"product:{product_id}", product)
    return product

After implementing this, our database load dropped by 92% during the next bot wave. It’s a simple architectural shift that pays massive dividends in stability. Protecting your system isn’t always about throwing more RAM at the problem. Sometimes, you just need a better data structure.

Share: