High-Performance LLM Inference: Scaling vLLM and Docker for Production

AI tutorial - IT technology blog
AI tutorial - IT technology blog

Moving Beyond Basic HuggingFace Wrappers

Running a Large Language Model (LLM) locally for a quick test is easy. Serving that same model to 500 concurrent users without your GPU catching fire is a different beast entirely. Most developers start by wrapping a model in a FastAPI or Flask app. This works for a single user, but performance hits a wall under the slightest load because of how GPUs manage memory.

If you’ve ever watched a GPU crash with an “Out of Memory” (OOM) error despite only processing a few short sentences, you’ve seen the KV cache problem firsthand. Traditional serving methods often waste 50% to 80% of GPU memory due to fragmentation. vLLM fixes this with PagedAttention. It manages memory much like an operating system’s virtual memory, allowing for massive throughput gains and snappier responses.

Quick Start (5 Minutes)

Docker is the cleanest way to get a production-ready server running. Ensure you have the NVIDIA Container Toolkit installed so Docker can actually talk to your hardware.

1. Pull the vLLM Image

Grab the official image. It includes the CUDA dependencies and optimized kernels needed for high-speed inference.

docker pull vllm/vllm-openai:latest

2. Launch a Llama 3 Model

Run the command below to start a Llama 3 (8B) server. We will map port 8000 and mount your local HuggingFace cache to save time on downloads.

docker run --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --env "HUGGING_FACE_HUB_TOKEN=<your_token>" \
    vllm/vllm-openai:latest \
    --model meta-llama/Meta-Llama-3-8B-Instruct

3. Test the API

vLLM mimics the OpenAI API structure, so it integrates with existing tools immediately. Test it with curl:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Meta-Llama-3-8B-Instruct",
        "messages": [{"role": "user", "content": "Explain quantum computing in one sentence."}]
    }'

The Architecture: Why PagedAttention Matters

To understand the speed, look at how LLMs generate text. Every predicted token creates “context” stored in GPU memory called the KV cache. Standard backends allocate this as one giant, continuous block for the maximum possible sequence length, such as 4,096 tokens.

If your prompt is only 50 tokens, the remaining 4,046 slots are reserved but sit empty. This is Internal Fragmentation. When dozens of users connect, these wasted blocks pile up until the GPU runs out of VRAM, even if the actual data is tiny.

Efficient Memory Management

vLLM partitions the KV cache into small, flexible blocks. These don’t need to be next to each other in physical memory. By managing space dynamically, vLLM packs significantly more requests into the same GPU. In my benchmarks on an A100, switching from a naive wrapper to vLLM increased throughput by 24x for long-context tasks.

Advanced Scaling for Heavy Workloads

When a model is too big for one card or the traffic gets heavy, you need to scale horizontally across your hardware.

Tensor Parallelism

A 70B model requires roughly 140GB of VRAM in FP16, which won’t fit on a single A100. Use the --tensor-parallel-size flag to split the model across multiple GPUs.

docker run --gpus all vllm/vllm-openai:latest \
    --model meta-llama/Meta-Llama-3-70B-Instruct \
    --tensor-parallel-size 4

This command turns 4 GPUs into a single high-capacity unit, working in sync to handle the massive parameter count.

Quantization (AWQ and GPTQ)

VRAM is expensive. If you are running on consumer hardware like an RTX 3090 (24GB), use 4-bit quantization. This cuts memory usage by nearly 50% with almost no noticeable drop in logic or accuracy.

docker run --gpus all vllm/vllm-openai:latest \
    --model casperhansen/llama-3-8b-instruct-awq \
    --quantization awq

Lessons from the Trenches

Deploying AI isn’t just about running a command; it’s about keeping the service alive under pressure. Here is what I’ve learned from managing production clusters.

Tune Memory Utilization

vLLM defaults to grabbing 90% of your VRAM. While great for throughput, it leaves no room for anything else. If you have monitoring agents or sidecars running, set --gpu-memory-utilization 0.8 to prevent system instability.

The 70B “Cold Start” Problem

Loading a 140GB model takes time. If you use Kubernetes, your liveness probes will likely kill the container before it finishes loading. Set your initialDelaySeconds to at least 300. I have seen countless restart loops caused by impatient health checks.

Pin Your Versions

Never use :latest in production. vLLM updates frequently, and a new image might change a flag that breaks your deployment script. Use a specific tag like vllm/vllm-openai:v0.4.2 to ensure your environment is reproducible and stable.

Setting up vLLM with Docker provides a robust foundation that rivals the performance of major providers like OpenAI. It transforms a slow, fragile model into a snappy, responsive API ready for real-world users.

Share: