The 2:15 AM Wake-Up Call: Why Your Container is Blind
Slack started screaming at 2:15 AM on a Tuesday. We were pushing a new computer vision model to our Ubuntu staging cluster. CI/CD was green, the Docker images were built, and Kubernetes reported a healthy deployment. But the logs told a different story: RuntimeError: Found no NVIDIA driver on your system.
On the host machine, nvidia-smi showed eight A100 GPUs running perfectly on version 535 drivers. Inside the container? Total darkness. The application was trying to crunch heavy tensor operations on a CPU, crawling at 3 iterations per second before eventually timing out. This is the moment you realize that host drivers aren’t enough. You need a bridge. You need the NVIDIA Container Toolkit.
In the field, this is a non-negotiable skill. If you’re building with AI or Deep Learning, knowing how to expose hardware acceleration to an isolated container is the difference between a working product and $30,000 of idle silicon.
The Architecture: Why Docker Can’t See Your GPU
By design, Docker containers are hardware-agnostic. They share the host’s kernel but live in a bubble, isolated from specific hardware drivers. If you try to pass a GPU through using standard Linux device mapping, you’ll likely hit library mismatches. These errors are a nightmare to debug at scale.
The NVIDIA Container Toolkit—formerly nvidia-docker2—solves this by acting as a custom runtime. When you launch a container with the --gpus flag, the toolkit handles three critical tasks:
- It mounts necessary NVIDIA user-level libraries into the container environment.
- It maps the physical device nodes (like
/dev/nvidia0) so the container can access them. - It validates compatibility between the host’s kernel driver and the container’s CUDA version.
Without it, you’re essentially trying to drive a car from a locked room with no windows. This toolkit provides the steering column and the view of the road.
Step-by-Step: Installing the Toolkit on Ubuntu
Prerequisites are simple: you must have NVIDIA drivers and Docker already installed on your Ubuntu host. If nvidia-smi fails on the host, stop here and fix the drivers first. Once the foundation is solid, follow these steps to bridge the gap.
1. Configure the Package Repository
NVIDIA hosts its own repository. We need to add the GPG key and the list to apt so it knows where to look for the binaries.
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
2. Install the Toolkit
Update your local index and install the package. This is a lightweight installation (roughly 15MB) that won’t overwrite your existing Docker setup; it simply adds the necessary plugins.
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
3. Configure the Docker Runtime
The binaries are on your disk, but Docker isn’t using them yet. We use the nvidia-ctk command to automatically patch /etc/docker/daemon.json, making nvidia a recognized runtime.
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Don’t skip the restart. I’ve wasted hours debugging “GPU not found” errors only to find the daemon was still running the old config. A quick systemctl restart is your best friend here.
The Verification: Testing the Pipeline
Let’s confirm the fix. We’ll pull a lightweight CUDA image and run nvidia-smi from *inside* the container.
sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.0.1-base-ubuntu22.04 nvidia-smi
If the GPU status table appears in your terminal, you’ve successfully tunneled through the container wall. The --gpus all flag exposes every card. If you’re managing a multi-GPU server and want to isolate workloads, use specific IDs, like --gpus "device=0,1".
Avoid These Common Production Pitfalls
Even with a clean install, production environments can be tricky. Here’s what usually goes wrong:
The CUDA Compatibility Trap
Version mismatches are the most frequent killers. If your host driver is version 470.xx but your Docker image requires CUDA 12.x, it will fail immediately. Newer drivers are backward compatible with older CUDA versions, but the reverse is rarely true. Always check the NVIDIA Compatibility Matrix.
Permission Grumbles
Running Docker in rootless mode? The installation changes slightly and requires extra configuration in the toolkit’s config.toml. For most production servers, adding your user to the docker group and using standard mode is the path of least resistance.
The Fabric Manager Requirement
On high-end systems like the DGX or servers using H100 GPUs with NVLink, you must ensure the nvidia-fabricmanager service is active. Without it, multi-GPU communication will fail silently, even if nvidia-smi looks correct.
Final Thoughts
Enabling GPU support in Docker isn’t just a matter of running commands. It’s about understanding the handover between the Linux kernel, the NVIDIA driver, and the runtime. Mastering the NVIDIA Container Toolkit allows you to deploy complex ML models across any Ubuntu infrastructure with confidence.
The next time you face a midnight deployment failure, remember: check the host driver, verify the toolkit configuration, and always restart the daemon. These foundational steps keep high-performance systems running smoothly while the rest of the world sleeps.

