Deploying Text Generation Inference (TGI) with Docker for High-Performance LLM Serving

AI tutorial - IT technology blog
AI tutorial - IT technology blog

The Production Bottleneck: When Simple Inference Fails

Deploying a Large Language Model (LLM) like Llama 3 for a personal experiment is easy. You load the model with the Transformers library, wrap it in a quick FastAPI endpoint, and it works. However, my team learned the hard way that this setup doesn’t scale. Six months ago, we launched a RAG-based internal tool, and it collapsed under the weight of just 50 concurrent users.

The symptoms were painful. Latency spiked to over 30 seconds. Timeout errors became the norm. Users stared at blank screens while the server struggled to process a single request, only to dump a massive wall of text all at once. We realized that standard Python-based inference handles production traffic poorly. It lacks the efficiency needed for a responsive user experience.

The Culprit: Why Standard Serving Chokes

To fix the lag, we had to rethink how LLMs process data. Traditional servers use sequential processing or static batching. In a static batch, the server waits for a set number of requests, groups them, and processes them as one matrix operation. This is inefficient. If one user asks for a haiku and another asks for a 500-word essay, the short request is held hostage by the long one. This wastes expensive GPU cycles.

Standard Python implementations also struggle with the Global Interpreter Lock (GIL). This overhead prevents the GPU from reaching full saturation. Without Token Streaming, users can’t see the model’s output as it’s generated, making the perceived wait time feel much longer. We needed Continuous Batching. This technique injects new requests into the batch the moment a token is generated for an existing one, keeping the GPU constantly busy.

Choosing the Right Stack: Why TGI?

We evaluated several specialized engines before choosing our production stack:

  • FastAPI + Transformers: Simple to build but lacks high-concurrency optimizations like PagedAttention.
  • vLLM: Extremely fast and popular for PagedAttention, but it felt less integrated with the Hugging Face ecosystem for specific niche models at the time.
  • Text Generation Inference (TGI): Hugging Face built this specifically for their production API. It’s a powerhouse written in Rust, C++, and Python.

TGI provides native support for Flash Attention, PagedAttention, and quantization methods like AWQ and bitsandbytes. By switching to TGI, we cut our hardware costs by 40% and doubled our total throughput. It handled 200+ concurrent requests on the same hardware that previously failed at 50.

Deploying TGI with Docker: A Step-by-Step Guide

Docker is the most reliable way to ship TGI. It packages complex NVIDIA drivers, CUDA 12.1 kernels, and Rust dependencies into one portable container. This eliminates the “it works on my machine” headache.

1. Hardware Prerequisites

You need an NVIDIA GPU with sufficient VRAM. For a Llama 3 (8B) model, aim for at least 16GB of VRAM (like an RTX 4090 or an A10G). Ensure the NVIDIA Container Toolkit is installed so Docker can talk to your hardware.

# Check if Docker can see your GPU
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

2. Launching the TGI Container

We’ll use the official Hugging Face image to run Llama-3-8B-Instruct. If you are using gated models, have your Hugging Face Hub token ready.

model="meta-llama/Meta-Llama-3-8B-Instruct"
volume=$PWD/data
token="your_hf_token_here"

docker run --gpus all --shm-size 1g -p 8080:80 \
    -v $volume:/data \
    -e HUGGING_FACE_HUB_TOKEN=$token \
    ghcr.io/huggingface/text-generation-inference:2.0 \
    --model-id $model \
    --max-batch-prefill-tokens 2048 \
    --max-total-tokens 4096

Breakdown of the settings:

  • --shm-size 1g: Allocates shared memory for fast GPU-to-GPU communication.
  • --max-total-tokens: Sets the hard limit for input plus output length.
  • --max-batch-prefill-tokens: Prevents Out-of-Memory (OOM) errors by limiting how many tokens are processed during the initial prompt phase.

3. Enabling Token Streaming

TGI shines when using Server-Sent Events (SSE). This allows your UI to display text character-by-character. Here is a Python snippet to consume the stream:

import requests
import json

def stream_llm_response(prompt):
    url = "http://localhost:8080/generate_stream"
    data = {
        "inputs": prompt,
        "parameters": {"max_new_tokens": 500, "temperature": 0.7}
    }

    response = requests.post(url, json=data, stream=True)
    
    for line in response.iter_lines():
        if line:
            decoded = line.decode('utf-8')
            if decoded.startswith("data:"):
                json_data = json.loads(decoded[5:])
                print(json_data['token']['text'], end="", flush=True)

stream_llm_response("What is Continuous Batching?")

Production Hardening: Lessons from the Field

Running TGI in a cluster for six months taught us a few critical optimization tricks. If your performance stalls, look at these three areas.

Shrink the Model with Quantization

When VRAM is tight, use quantization. Adding --quantize bitsandbytes-nf4 to your command can drop memory usage significantly. For an 8B model, this can reduce the VRAM footprint from ~15GB to under 6GB. This lets you run larger models on cheaper hardware with almost no loss in logic.

Scale with Tensor Parallelism

Models larger than 13B parameters rarely fit on one consumer GPU. TGI makes multi-GPU setups simple. Use the --num-shard flag (e.g., --num-shard 2) to split the model across two GPUs. TGI handles the complex math of distributing the workload automatically.

Monitor the Metrics

TGI has a built-in /metrics endpoint for Prometheus. Watch tgi_request_queue_size closely. If the queue stays high, your GPU is saturated. This is your signal to spin up another TGI instance behind a load balancer.

Building Better AI Infrastructure

Moving from a Python script to a dedicated engine like TGI is a major step for any AI engineer. Docker makes this infrastructure reproducible and easy to scale. By combining Rust-based speed with Continuous Batching, TGI offers a robust way to serve open-source models at scale. Focus on tuning your max-batch-size and monitoring GPU utilization to get the most out of your hardware.

Share: