The Headache of Data Desync
I’ve lost count of how many times I’ve had to sync an Elasticsearch index or a Redis cache with a primary production database. Early in my career, I relied on the “Dual Write” pattern. My application would write to the database and then immediately try to update the cache. It works fine in a perfect world, but the real world is messy. If the second write fails due to a network hiccup, you’re left with “data drift”—a silent, painful discrepancy that’s a nightmare to track down.
I also experimented with polling. I’d set up a cron job to fetch records where updated_at > last_sync_time every 60 seconds. While simple, this approach is a resource hog. On a modest AWS t3.medium instance, frequent polling queries can spike CPU usage by 15-20% even when no data has changed. Plus, polling is blind to hard deletes; if a row is gone, your query won’t find it to tell the rest of your system.
Think of Your Database as a Stream, Not a Box
The mistake most of us make is treating the database like a static container. We wait for data to land and then try to pull it back out. But under the hood, every modern database is already screaming its changes to a log file. MySQL uses the Binlog, and PostgreSQL uses the Write-Ahead Log (WAL).
Change Data Capture (CDC) flips the script. Instead of asking the database for data, we listen to these internal logs. Debezium is the heavy lifter here. It’s an open-source tool that plugs into Apache Kafka, converting those raw database bytes into a clean, continuous stream of JSON events. It’s the difference between checking your mail once a day and having a live feed of your front porch.
Setting Up the Plumbing
To build a production-ready CDC pipeline, you need three pillars: the source database, Apache Kafka to act as the message backbone, and Debezium running inside Kafka Connect. For local development, Docker Compose is your best friend. It lets you spin up the entire ecosystem in seconds.
version: '3.8'
services:
zookeeper:
image: quay.io/debezium/zookeeper:2.4
ports:
- 2181:2181
kafka:
image: quay.io/debezium/kafka:2.4
ports:
- 9092:9092
depends_on:
- zookeeper
environment:
- ZOOKEEPER_CONNECT=zookeeper:2181
mysql:
image: quay.io/debezium/example-mysql:2.4
ports:
- 3306:3306
environment:
- MYSQL_ROOT_PASSWORD=debezium
- MYSQL_USER=mysqluser
- MYSQL_PASSWORD=mysqlpw
connect:
image: quay.io/debezium/connect:2.4
ports:
- 8083:8083
depends_on:
- kafka
- mysql
environment:
- BOOTSTRAP_SERVERS=kafka:9092
- GROUP_ID=1
- CONFIG_STORAGE_TOPIC=my_connect_configs
- OFFSET_STORAGE_TOPIC=my_connect_offsets
- STATUS_STORAGE_TOPIC=my_connect_statuses
Fire this up with docker-compose up -d. This specific MySQL image comes pre-configured with binary logging enabled, saving you a dozen manual configuration steps.
Configuring the Debezium Connector
If you’re pointing Debezium at an existing production MySQL instance, you’ll need to tweak your my.cnf file. Debezium requires ROW level logging to understand exactly what changed in each row. Without this, it only sees the SQL statement, which isn’t enough to reconstruct the data state.
[mysqld]
server-id = 223344
log_bin = mysql-bin
binlog_format = ROW
binlog_row_image = FULL
expire_logs_days = 7
With the database ready, we tell Debezium to start watching. We do this by posting a JSON configuration to the Kafka Connect REST API. I prefer using curl for this because it’s easy to version control in a shell script.
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8083/connectors/ -d '{
"name": "inventory-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1",
"database.hostname": "mysql",
"database.port": "3306",
"database.user": "debezium",
"database.password": "dbz",
"database.server.id": "184054",
"topic.prefix": "production_db",
"database.include.list": "inventory",
"schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
"schema.history.internal.kafka.topic": "schemahistory.inventory"
}
}'
Take note of the topic.prefix. This string will be the start of every Kafka topic name. For example, a table named customers will stream data to production_db.inventory.customers.
Seeing the Stream in Action
Once the connector is live, Debezium performs a “snapshot” of your current data. It then transitions into streaming mode. You can watch this happen in real-time by consuming from the Kafka topic using the built-in console consumer.
docker exec -it kafka /kafka/bin/kafka-console-consumer.sh \
--bootstrap-server localhost:9092 \
--topic production_db.inventory.customers \
--from-beginning
Try running an UPDATE query in MySQL. Almost instantly, a JSON object will hit your terminal. The beauty of Debezium is the payload structure. It provides a before and after snapshot of the row. This makes it incredibly easy to see exactly which fields changed without having to query the database again.
Lessons from the Trenches
CDC is powerful, but it isn’t a silver bullet. When you scale to hundreds of tables, Kafka topic management becomes a full-time job. You’ll want to automate topic creation and set strict retention policies. Otherwise, your Kafka disk space will vanish faster than you expect.
Also, keep an eye on your connector health. I’ve seen connectors fail because of a sudden schema change that Debezium didn’t expect. Always monitor the /status endpoint. If the state hits FAILED, your sync stops, and the lag starts building up. In a high-traffic environment moving 10,000 events per second, even a few minutes of downtime can result in a massive backlog.
