Deploying TiDB on Docker: Scaling HTAP with a Distributed NewSQL Database

Table of Contents

The Scaling Wall and the ETL Tax

Most of us start our journey with a traditional relational database. Having worked with MySQL, PostgreSQL, and MongoDB across different projects, each has its own strengths. MySQL is usually my go-to for reliability and ease of use.

However, I’ve often hit a wall when a project grows to a point where a single MySQL instance can’t handle the write load. Traditionally, the solution was sharding—a complex process that involves splitting your data across multiple servers. It’s a maintenance nightmare that breaks cross-shard transactions and makes querying a headache.

Then there is the analytical side. When the business team wants real-time reports, running heavy aggregate queries on your production MySQL instance is a recipe for a performance meltdown. To solve this, we usually build ETL (Extract, Transform, Load) pipelines to move data to a separate data warehouse like ClickHouse or Snowflake. This introduces data latency and more moving parts to break.

TiDB changes this dynamic. It is a NewSQL database that is horizontally scalable, supports ACID transactions, and speaks the MySQL protocol. But its real magic lies in HTAP (Hybrid Transactional/Analytical Processing). It allows you to run transactional (OLTP) and analytical (OLAP) workloads on the same cluster without them interfering with each other. I’ve found this to be a game-changer for teams that need real-time insights without the overhead of complex ETL pipelines.

Core Concepts: How TiDB Achieves the Impossible

Before jumping into the Docker setup, it helps to understand the architecture. TiDB isn’t a single binary; it’s a cluster of specialized components working together:

TiDB Server: The stateless SQL layer. It handles client connections, parses SQL, and optimizes query execution. You can scale this layer out to handle more concurrent connections.
PD (Placement Driver): The brain of the cluster. It stores metadata, manages data distribution, and handles timestamp allocation for transactions.
TiKV: The row-based storage engine. This is where your transactional data lives. It uses the Raft consensus algorithm to ensure high availability and data consistency.
TiFlash: The columnar storage engine. This is what makes TiDB an HTAP database. It provides near-real-time analytical capabilities by replicating data from TiKV into a columnar format optimized for scans and aggregations.

Because TiDB is compatible with the MySQL 5.7 protocol, your existing applications can usually switch to TiDB just by changing the connection string. No need to rewrite your ORM logic or complex queries.

Setting Up TiDB on Docker

For development and testing, Docker Compose is the most efficient way to get a TiDB cluster running. While TiDB offers a tool called tiup for production deployments, Docker gives us a clean, isolated environment to explore HTAP features.

1. The Docker Compose Configuration

Create a directory for your project and save the following as docker-compose.yml. This setup includes the core components: PD, TiKV, TiDB, and TiFlash.

version: '3.8'

services:
  pd:
    image: pingcap/pd:latest
    ports:
      - "2379:2379"
    command:
      - --name=pd
      - --data-dir=/data/pd
      - --client-urls=http://0.0.0.0:2379
      - --advertise-client-urls=http://pd:2379
      - --peer-urls=http://0.0.0.0:2380
      - --advertise-peer-urls=http://pd:2380
      - --initial-cluster=pd=http://pd:2380
      - --log-file=/logs/pd.log

  tikv:
    image: pingcap/tikv:latest
    depends_on:
      - pd
    command:
      - --addr=0.0.0.0:20160
      - --advertise-addr=tikv:20160
      - --data-dir=/data/tikv
      - --pd-endpoints=http://pd:2379
      - --log-file=/logs/tikv.log

  tidb:
    image: pingcap/tidb:latest
    ports:
      - "4000:4000"
      - "10080:10080"
    depends_on:
      - pd
      - tikv
    command:
      - --store=tikv
      - --path=pd:2379
      - --log-file=/logs/tidb.log

  tiflash:
    image: pingcap/tiflash:latest
    depends_on:
      - pd
      - tikv
    command:
      - server
      - --config-file=/etc/tiflash/tiflash-learner.toml
    # Note: TiFlash requires specific configurations for a stable run.
    # For this demo, the default image settings will suffice.

2. Launching the Cluster

Run the following command to start the services in the background:

docker-compose up -d

It takes a minute or two for all components to initialize and register with the Placement Driver (PD). You can check the status of the containers using docker-compose ps.

Testing HTAP Capabilities

Once the cluster is up, you can connect to it using any MySQL client. I’ll use the standard command-line client here. The default port for TiDB is 4000, and the default user is root with no password.

mysql -h 127.0.0.1 -P 4000 -u root

Creating a Table with TiFlash Replica

To leverage the HTAP features, you need to tell TiDB which tables should be replicated to the columnar storage (TiFlash). Let’s create a sample table and enable TiFlash replication.

CREATE DATABASE analytics_demo;
USE analytics_demo;

CREATE TABLE orders (
    id BIGINT PRIMARY KEY,
    customer_id INT,
    amount DECIMAL(10, 2),
    order_date DATETIME
);

-- Add a TiFlash replica for this table
ALTER TABLE orders SET TIFLASH REPLICA 1;

You can check the replication status with this query. The AVAILABLE column will turn to 1 once the data is synced.

SELECT * FROM information_schema.tiflash_replica WHERE table_name = 'orders';

Querying with the Right Engine

TiDB’s optimizer is smart enough to decide whether to use TiKV (row-based) or TiFlash (column-based) depending on your query. Point selects and small transactions go to TiKV. Large aggregations go to TiFlash.

You can force a query to use TiFlash to see the performance difference on large datasets:

-- Force the use of TiFlash for this session
SET @@session.tidb_isolation_read_engines = "tiflash";

SELECT SUM(amount), DATE(order_date) 
FROM orders 
GROUP BY DATE(order_date);

Best Practices from the Field

After experimenting with TiDB in various environments, I’ve gathered a few tips that will save you time during deployment and scaling.

Resource Allocation

TiDB is resource-intensive. In a production environment, don’t try to co-locate TiKV and TiFlash on the same physical disk if possible. TiKV needs low-latency IO for transactions, while TiFlash needs high throughput for scans. On Docker, ensure you allocate enough memory (at least 4GB for a demo, much more for production) or the containers might exit unexpectedly.

Monitoring is Not Optional

One thing I love about the TiDB ecosystem is its built-in integration with Prometheus and Grafana. If you use the official tiup deployment method later, it sets these up automatically. For Docker, I recommend adding a Grafana container to your compose file to visualize the PD TSO (Timestamp Oracle) wait times and TiKV storage engine metrics. It’s the only way to truly understand where your bottlenecks are.

Mind the Versioning

The TiDB ecosystem moves fast. Always ensure that your TiDB, TiKV, and PD versions match exactly. Running a TiDB v6.5 server with a TiKV v7.1 node can lead to subtle bugs in the Raft consensus layer that are hard to debug.

Final Thoughts

Moving from a single-node MySQL setup to a distributed NewSQL database like TiDB feels like a massive leap, but the Docker-based approach makes the learning curve much flatter. It solves the two biggest headaches in modern data engineering: horizontal scaling and the OLTP/OLAP divide.

By using TiFlash, you get the benefits of a data warehouse with the simplicity of a single database connection. If your application is outgrowing its current relational database, giving TiDB a spin on Docker is a great way to evaluate if it’s the right fit for your next stage of growth.