Six Months with Kafka: What I Wish Someone Had Told Me
I was skeptical the first time my team proposed replacing our batch ETL pipeline with Apache Kafka. We were processing about 2 million events per day — enough to feel the pain of hourly delays, but not so much that the operational overhead of a distributed streaming platform felt justified.
Six months later, that skepticism is gone. Kafka didn’t just solve our latency problem. It changed how we think about data flow entirely. Here’s what the journey actually looked like.
Quick Start: Kafka Running in 5 Minutes
Before anything else, get Kafka running locally so you have something real to poke at. Docker Compose is by far the fastest path.
# docker-compose.yml
version: '3'
services:
zookeeper:
image: confluentinc/cp-zookeeper:7.5.0
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
kafka:
image: confluentinc/cp-kafka:7.5.0
depends_on:
- zookeeper
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
docker-compose up -d
# Create your first topic
docker exec -it <kafka-container> kafka-topics \
--bootstrap-server localhost:9092 \
--create --topic user-events \
--partitions 3 \
--replication-factor 1
# Verify it exists
docker exec -it <kafka-container> kafka-topics \
--bootstrap-server localhost:9092 \
--list
Now send your first message and consume it:
# Producer (terminal 1)
docker exec -it <kafka-container> kafka-console-producer \
--bootstrap-server localhost:9092 \
--topic user-events
# Consumer (terminal 2)
docker exec -it <kafka-container> kafka-console-consumer \
--bootstrap-server localhost:9092 \
--topic user-events \
--from-beginning
Type something in terminal 1 and watch it appear in terminal 2 within milliseconds. That’s the core loop of everything Kafka does, at massive scale.
Deep Dive: How Kafka Actually Works
Most tutorials skip the internals entirely — which is why production Kafka surprises people. Here’s the mental model you need before you deploy anything serious.
Topics, Partitions, and Offsets
A topic is just a named log of events. Events are appended sequentially and stored durably on disk. Unlike a message queue, Kafka doesn’t delete messages after consumption. That’s what makes Kafka a log, not just a broker.
Topics are divided into partitions. Each partition is an ordered, immutable sequence of records. With 3 partitions and 3 consumers in a group, each consumer owns exactly one partition. More partitions means more parallelism — that’s how Kafka scales horizontally.
Each message in a partition has an offset — a sequential integer. Consumers track their own offset, so they can replay, skip, or reprocess events at any time. Coming from systems where a consumed message vanished forever, this felt like a superpower the first time I used it.
Producer and Consumer in Python
The confluent-kafka library is what I use in production. It wraps the official C library and benchmarks roughly 3–5× faster than kafka-python for high-throughput workloads — the difference shows up clearly above ~50k messages/sec.
pip install confluent-kafka
# producer.py
from confluent_kafka import Producer
import json
import time
producer = Producer({'bootstrap.servers': 'localhost:9092'})
def delivery_report(err, msg):
if err:
print(f'Delivery failed: {err}')
else:
print(f'Delivered to {msg.topic()} [{msg.partition()}] @ offset {msg.offset()}')
for i in range(100):
event = {
'user_id': f'user_{i % 10}',
'action': 'page_view',
'timestamp': time.time()
}
producer.produce(
'user-events',
key=event['user_id'], # Same key always goes to same partition
value=json.dumps(event).encode('utf-8'),
callback=delivery_report
)
producer.poll(0)
producer.flush()
# consumer.py
from confluent_kafka import Consumer
import json
consumer = Consumer({
'bootstrap.servers': 'localhost:9092',
'group.id': 'analytics-service',
'auto.offset.reset': 'earliest'
})
consumer.subscribe(['user-events'])
try:
while True:
msg = consumer.poll(timeout=1.0)
if msg is None:
continue
if msg.error():
print(f'Consumer error: {msg.error()}')
continue
event = json.loads(msg.value().decode('utf-8'))
print(f"Processing: user={event['user_id']}, action={event['action']}")
finally:
consumer.close()
One thing I learned the hard way: always set a meaningful consumer group ID. If two services accidentally share the same group ID, they’ll split the partitions between them and each will miss half the events. Silent data loss is the worst kind of failure — there’s no error, no alert, just missing numbers in your dashboard three days later.
Advanced Usage: Building a Real Pipeline
A single producer and consumer is a demo. Real pipelines have multiple stages — and that’s where Kafka’s design starts to pay off.
Chaining Topics for Multi-Stage Processing
Each topic becomes a stage in your pipeline:
raw-events— everything coming in from the applicationvalidated-events— after schema validation and deduplicationenriched-events— after joining with user profile dataanalytics-aggregates— final computed metrics
Each microservice consumes from one topic and produces to the next. Independent scaling, clean failure boundaries, and the ability to replay any stage without touching the others.
Consumer Groups for Horizontal Scaling
# Scale out by running multiple instances with the same group.id
# Kafka automatically rebalances partitions across active consumers
consumer = Consumer({
'bootstrap.servers': 'localhost:9092',
'group.id': 'analytics-service', # Same group = load balanced
'auto.offset.reset': 'latest',
'enable.auto.commit': False # Manual commit for exactly-once semantics
})
consumer.subscribe(['validated-events'])
while True:
msg = consumer.poll(1.0)
if msg and not msg.error():
process_event(json.loads(msg.value()))
consumer.commit(asynchronous=False) # Commit after successful processing
Kafka Streams for Stateful Processing
Aggregations like “count events per user in the last 5 minutes” require stateful stream processing. On the JVM, Kafka Streams is the natural fit. For Python, Faust handles this cleanly:
pip install faust
import faust
app = faust.App('analytics', broker='kafka://localhost:9092')
events_topic = app.topic('validated-events')
# Windowed table: count events per user in 5-minute windows
user_event_counts = app.Table(
'user-event-counts',
default=int,
).tumbling(300.0) # 5-minute tumbling window
@app.agent(events_topic)
async def process(events):
async for event in events.group_by(lambda e: e['user_id']):
user_event_counts[event['user_id']] += 1
print(f"User {event['user_id']}: {user_event_counts[event['user_id']].current()} events")
Practical Tips from Production
Getting Kafka running is the easy part. Keeping it running under load without 3am pages — that’s where things get interesting. The tips below are the ones I learned the expensive way.
Partition Count Is a One-Way Door
You can increase partitions later, but you can’t decrease them. Worse, increasing partitions breaks key-based ordering — messages with the same key may now land on different partitions. Think hard about partition count upfront. A workable rule of thumb: aim for 2–3× your expected peak consumer count. For a medium-traffic topic handling ~10k events/sec, 50 partitions is usually more than enough.
Monitor Lag, Not Throughput
Throughput tells you how fast you’re processing. Lag tells you how far behind you are. A consumer can show excellent throughput and still be falling behind if the producer is faster. Set up lag alerts — I use Burrow, or the built-in kafka-consumer-groups CLI works in a pinch:
kafka-consumer-groups \
--bootstrap-server localhost:9092 \
--describe \
--group analytics-service
If the LAG column is growing, you need more consumers. You can add up to one consumer per partition before you hit the ceiling.
Set Retention Based on Your Replay Window
Kafka’s default retention is 7 days. Ask yourself: during a bad incident, how far back might you need to replay? For our hot path, 3 days is plenty. Critical topics get 30 days, configured at the topic level:
kafka-configs \
--bootstrap-server localhost:9092 \
--entity-type topics \
--entity-name user-events \
--alter \
--add-config retention.ms=2592000000 # 30 days in milliseconds
Use Schema Registry in Production
Plain JSON is fine in development. In production, schema drift will eventually corrupt your pipeline — a field renamed upstream silently breaks downstream consumers. Confluent Schema Registry with Avro or Protobuf enforces compatibility at the broker level. Extra setup, yes. But I’ve watched it catch breaking changes that would have caused data loss on at least three separate occasions.
Don’t Fear Rebalancing, But Handle It
When a consumer joins or leaves a group, Kafka triggers a rebalance. Processing pauses for a few seconds across the entire group. If you’re doing manual offset commits, commit in your on_revoke callback before partitions are reassigned — skip this and you’ll reprocess the same events after every rebalance:
from confluent_kafka import Consumer, TopicPartition
def on_revoke(consumer, partitions):
# Commit before partitions are taken away
consumer.commit(asynchronous=False)
consumer.subscribe(['user-events'], on_revoke=on_revoke)
Is Kafka Right for Your Use Case?
Kafka shines for high-throughput event streaming, durable logs, and decoupling services at scale. For simple task queues, it’s overkill — Redis or RabbitMQ will serve you better with a fraction of the operational complexity. Kafka also demands real ops work: distributed state, replication factor decisions, retention policies, and consumer lag monitoring all need attention.
After six months: if you’re handling tens of thousands of events per second across multiple services, Kafka is worth every bit of the learning curve. If you’re at a few hundred events per minute, start simpler and migrate when the pain becomes undeniable. Kafka will still be there — and you’ll appreciate it more when you actually need it.

