Building a Local AI Gateway with LiteLLM: Centralize OpenAI, Anthropic, and Ollama

AI tutorial - IT technology blog
AI tutorial - IT technology blog

Direct Integration vs. AI Gateway: Which Path to Choose?

Building an AI-powered app usually starts with a single OpenAI API key. You plug it into your environment variables and everything works. But the moment you want to add Claude for better reasoning or a local Llama 3 instance for privacy, your codebase turns into a messy pile of conflicting SDKs and redundant error handling.

In a direct integration setup, your app talks to each provider individually. If OpenAI hits a rate limit at 2 AM, your service breaks. You are forced to manually code fallback logic for Anthropic or Gemini. Managing costs also becomes a logistical headache as API keys and usage data end up scattered across different platforms.

An AI Gateway flips this architecture. Your application communicates with a single internal endpoint using the standard OpenAI format. The gateway sits in the middle, handling routing, retries, and authentication. It acts as a smart traffic controller that translates your request to whichever provider is fastest or cheapest at that exact microsecond.

The Real-World Trade-offs of a Proxy-First Architecture

Adding another layer to your tech stack is a big decision. Here is how the pros and cons actually play out in a production environment.

The Advantages

  • Unified API: You write code once using the OpenAI SDK. Whether you are calling GPT-4o or a local model via Ollama, the syntax never changes.
  • Virtual Keys: Stop sharing your master Anthropic key. Instead, generate “Virtual Keys” for specific developers or features with built-in expiration dates.
  • Automatic Failover: If GPT-4o returns a 500 error, LiteLLM instantly reroutes the request to Claude 3.5 Sonnet. Your users won’t even notice the hiccup.
  • Granular Tracking: You get a clear view of token usage. You can see exactly which feature is burning through your budget in real-time.

The Drawbacks

  • Operational Overhead: The gateway is a critical piece of infrastructure. If it goes down, your entire AI capability goes dark. This makes a high-availability deployment non-negotiable.
  • Minor Latency: Routing through a proxy adds roughly 5–15ms of network overhead. Considering an average LLM response takes 1,000ms or more, this delay is practically invisible to the end user.

A Production-Ready Architecture

For a stable setup, run LiteLLM as a Docker container backed by a PostgreSQL database. The database is the “brain” of the operation. It stores your virtual keys, tracks usage limits, and logs every request for debugging. Without it, LiteLLM runs in stateless mode, which is fine for a quick demo but risky for a scaling business.

I have seen this setup save teams significant money. In one project, we cut monthly API costs by 30%. We did this by routing simple classification tasks to a local Llama 3 instance while reserving expensive models like GPT-4o only for complex reasoning tasks.

Step-by-Step Implementation Guide

Let’s build a LiteLLM Proxy that manages OpenAI, Anthropic, and a local Ollama instance simultaneously.

1. Configure the Model Routing

Everything in LiteLLM revolves around the config.yaml file. This is where you map your internal model names to actual providers. Create a file named litellm_config.yaml:

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: "os.environ/OPENAI_API_KEY"

  - model_name: claude-3-sonnet
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20240620
      api_key: "os.environ/ANTHROPIC_API_KEY"

  - model_name: local-llama
    litellm_params:
      model: ollama/llama3
      api_base: "http://host.docker.internal:11434"

router_settings:
  routing_strategy: simple-shuffle
  set_verbose: True

2. Launch with Docker Compose

Docker Compose ensures your gateway and database stay synced. Create a docker-compose.yml file to manage the services:

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - OPENAI_API_KEY=sk-your-key
      - ANTHROPIC_API_KEY=sk-ant-key
      - DATABASE_URL=postgresql://user:pass@db:5432/litellm
    command: ["--config", "/app/config.yaml"]
    depends_on:
      - db

  db:
    image: postgres:16-alpine
    environment:
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=pass
      - POSTGRES_DB=litellm
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

Run docker-compose up -d. Your gateway is now listening for requests at http://localhost:4000.

3. Set Up Smart Fallbacks

You can group multiple providers under a single alias to ensure high availability. If one provider fails, the gateway automatically tries the next one in the list.

model_list:
  - model_name: high-reliability-model
    litellm_params:
      model: openai/gpt-4o
  - model_name: high-reliability-model
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20240620

router_settings:
  routing_strategy: latency-based-routing
  enable_fallbacks: True

Now your app just requests high-reliability-model. LiteLLM handles the heavy lifting of picking the most responsive provider.

4. Connect Your Application

Testing the gateway is straightforward. You can use a standard cURL command to verify it is routing correctly:

curl --request POST \
  --url http://localhost:4000/chat/completions \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Why use a gateway?"}]
  }'

In Python, you only need to update the base_url. The rest of your OpenAI-compatible code remains untouched:

from openai import OpenAI

client = OpenAI(
    api_key="sk-anything", # The gateway validates this
    base_url="http://localhost:4000"
)

response = client.chat.completions.create(
    model="claude-3-sonnet",
    messages=[{"role": "user", "content": "Explain the proxy pattern."}]
)

print(response.choices[0].message.content)

Hardening Your Budget and Monitoring

Once your traffic flows through the gateway, you can access the built-in UI (usually at /ui) to manage spend. I recommend creating separate virtual keys for Development, Staging, and Production environments.

Budgeting is where the gateway really shines. You can set a max_budget of $50 on a Development key. If a developer accidentally triggers a recursive loop that spams the API, the gateway acts as a circuit breaker. It will block all further requests once the limit is hit, preventing a $1,000 surprise on your next credit card statement.

This centralized strategy moves you away from checking three different billing dashboards every morning. Instead, you have one central command center for your entire AI infrastructure. It makes scaling your operations much less stressful when you know exactly where every cent is being spent.

Share: