Stop Polling Your Database: A Practical Guide to CDC with Debezium and Kafka

Database tutorial - IT technology blog
Database tutorial - IT technology blog

The Headache of Data Desync

I’ve lost count of how many times I’ve had to sync an Elasticsearch index or a Redis cache with a primary production database. Early in my career, I relied on the “Dual Write” pattern. My application would write to the database and then immediately try to update the cache. It works fine in a perfect world, but the real world is messy. If the second write fails due to a network hiccup, you’re left with “data drift”—a silent, painful discrepancy that’s a nightmare to track down.

I also experimented with polling. I’d set up a cron job to fetch records where updated_at > last_sync_time every 60 seconds. While simple, this approach is a resource hog. On a modest AWS t3.medium instance, frequent polling queries can spike CPU usage by 15-20% even when no data has changed. Plus, polling is blind to hard deletes; if a row is gone, your query won’t find it to tell the rest of your system.

Think of Your Database as a Stream, Not a Box

The mistake most of us make is treating the database like a static container. We wait for data to land and then try to pull it back out. But under the hood, every modern database is already screaming its changes to a log file. MySQL uses the Binlog, and PostgreSQL uses the Write-Ahead Log (WAL).

Change Data Capture (CDC) flips the script. Instead of asking the database for data, we listen to these internal logs. Debezium is the heavy lifter here. It’s an open-source tool that plugs into Apache Kafka, converting those raw database bytes into a clean, continuous stream of JSON events. It’s the difference between checking your mail once a day and having a live feed of your front porch.

Setting Up the Plumbing

To build a production-ready CDC pipeline, you need three pillars: the source database, Apache Kafka to act as the message backbone, and Debezium running inside Kafka Connect. For local development, Docker Compose is your best friend. It lets you spin up the entire ecosystem in seconds.

version: '3.8'
services:
  zookeeper:
    image: quay.io/debezium/zookeeper:2.4
    ports:
     - 2181:2181

  kafka:
    image: quay.io/debezium/kafka:2.4
    ports:
     - 9092:9092
    depends_on:
     - zookeeper
    environment:
     - ZOOKEEPER_CONNECT=zookeeper:2181

  mysql:
    image: quay.io/debezium/example-mysql:2.4
    ports:
     - 3306:3306
    environment:
     - MYSQL_ROOT_PASSWORD=debezium
     - MYSQL_USER=mysqluser
     - MYSQL_PASSWORD=mysqlpw

  connect:
    image: quay.io/debezium/connect:2.4
    ports:
     - 8083:8083
    depends_on:
     - kafka
     - mysql
    environment:
     - BOOTSTRAP_SERVERS=kafka:9092
     - GROUP_ID=1
     - CONFIG_STORAGE_TOPIC=my_connect_configs
     - OFFSET_STORAGE_TOPIC=my_connect_offsets
     - STATUS_STORAGE_TOPIC=my_connect_statuses

Fire this up with docker-compose up -d. This specific MySQL image comes pre-configured with binary logging enabled, saving you a dozen manual configuration steps.

Configuring the Debezium Connector

If you’re pointing Debezium at an existing production MySQL instance, you’ll need to tweak your my.cnf file. Debezium requires ROW level logging to understand exactly what changed in each row. Without this, it only sees the SQL statement, which isn’t enough to reconstruct the data state.

[mysqld]
server-id         = 223344
log_bin           = mysql-bin
binlog_format     = ROW
binlog_row_image  = FULL
expire_logs_days  = 7

With the database ready, we tell Debezium to start watching. We do this by posting a JSON configuration to the Kafka Connect REST API. I prefer using curl for this because it’s easy to version control in a shell script.

curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8083/connectors/ -d '{
  "name": "inventory-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "tasks.max": "1",
    "database.hostname": "mysql",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "dbz",
    "database.server.id": "184054",
    "topic.prefix": "production_db",
    "database.include.list": "inventory",
    "schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
    "schema.history.internal.kafka.topic": "schemahistory.inventory"
  }
}'

Take note of the topic.prefix. This string will be the start of every Kafka topic name. For example, a table named customers will stream data to production_db.inventory.customers.

Seeing the Stream in Action

Once the connector is live, Debezium performs a “snapshot” of your current data. It then transitions into streaming mode. You can watch this happen in real-time by consuming from the Kafka topic using the built-in console consumer.

docker exec -it kafka /kafka/bin/kafka-console-consumer.sh \
    --bootstrap-server localhost:9092 \
    --topic production_db.inventory.customers \
    --from-beginning

Try running an UPDATE query in MySQL. Almost instantly, a JSON object will hit your terminal. The beauty of Debezium is the payload structure. It provides a before and after snapshot of the row. This makes it incredibly easy to see exactly which fields changed without having to query the database again.

Lessons from the Trenches

CDC is powerful, but it isn’t a silver bullet. When you scale to hundreds of tables, Kafka topic management becomes a full-time job. You’ll want to automate topic creation and set strict retention policies. Otherwise, your Kafka disk space will vanish faster than you expect.

Also, keep an eye on your connector health. I’ve seen connectors fail because of a sudden schema change that Debezium didn’t expect. Always monitor the /status endpoint. If the state hits FAILED, your sync stops, and the lag starts building up. In a high-traffic environment moving 10,000 events per second, even a few minutes of downtime can result in a massive backlog.

Share: