Build a Python Chatbot with OpenAI API: From Zero to Production

Table of Contents

Quick Start — Get Your Chatbot Running in 5 Minutes

Picture this: it’s late, you have a client demo tomorrow morning, and you need a working chatbot prototype — now. Here’s the fastest path.

First, install the dependency:

pip install openai

Grab your API key from platform.openai.com → API Keys, then write this:

from openai import OpenAI

client = OpenAI(api_key="sk-...")  # or set OPENAI_API_KEY env var

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Hello, who are you?"}
    ]
)

print(response.choices[0].message.content)

Run it. You should see a response. One API call, one reply — that’s the core. The demo works. Now let’s make it useful.

Deep Dive: How Conversation Context Really Works

The single biggest mistake with OpenAI chatbots is treating each call as isolated. The API is stateless — it remembers nothing between requests. Sending conversation history every time is your job.

Here’s a minimal but functional chatbot loop:

from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from environment

conversation_history = [
    {"role": "system", "content": "You are a helpful assistant specializing in Linux and DevOps."}
]

def chat(user_message: str) -> str:
    conversation_history.append({"role": "user", "content": user_message})

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=conversation_history,
        max_tokens=1024,
        temperature=0.7
    )

    assistant_reply = response.choices[0].message.content
    conversation_history.append({"role": "assistant", "content": assistant_reply})
    return assistant_reply

# Simple interactive loop
while True:
    user_input = input("You: ").strip()
    if user_input.lower() in ("exit", "quit"):
        break
    reply = chat(user_input)
    print(f"Bot: {reply}\n")

conversation_history grows with every turn. The system message stays at position 0, then user/assistant messages alternate. The model sees the full thread on each call.

Understanding the Three Roles

system — sets the chatbot’s persona and rules. Define this once at startup.
user — messages from the human side.
assistant — the model’s previous replies, appended to maintain multi-turn context.

Keeping the Token Budget in Check

Every message in history counts toward the context limit. With gpt-4o-mini you have a 128K token window — generous, but long sessions accumulate cost fast. A simple trim strategy prevents unbounded growth:

MAX_HISTORY_MESSAGES = 20  # keep last 20 exchanges (excluding system prompt)

def trim_history():
    system_msg = conversation_history[0]
    recent = conversation_history[1:][-MAX_HISTORY_MESSAGES:]
    conversation_history.clear()
    conversation_history.append(system_msg)
    conversation_history.extend(recent)

Call trim_history() before each API call. Two lines that save you from a nasty billing surprise.

Advanced Usage: Streaming, Error Handling, and Persistence

Streaming Responses for Real-Time Feel

Streaming is the difference between a chatbot that feels fast and one that just is fast. Tokens print as they arrive:

def chat_stream(user_message: str) -> str:
    conversation_history.append({"role": "user", "content": user_message})

    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=conversation_history,
        stream=True
    )

    full_response = ""
    print("Bot: ", end="", flush=True)

    for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        print(delta, end="", flush=True)
        full_response += delta

    print()  # newline after stream ends
    conversation_history.append({"role": "assistant", "content": full_response})
    return full_response

I ran this in production for a customer support interface. The first words appeared almost immediately after the user hit Enter — even on 600-token replies. Users stopped asking “is it loading?”

Retry Logic for Rate Limits and Network Hiccups

When something breaks at 2 AM, it’s usually rate limits or a flaky connection. A simple exponential backoff wrapper:

import time
import openai

def chat_with_retry(user_message: str, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        try:
            return chat(user_message)
        except openai.RateLimitError:
            wait_time = 2 ** attempt  # 1s, 2s, 4s
            print(f"Rate limit hit. Waiting {wait_time}s...")
            time.sleep(wait_time)
        except openai.APIConnectionError as e:
            print(f"Connection error: {e}")
            time.sleep(1)
        except openai.APIStatusError as e:
            print(f"API error {e.status_code}: {e.message}")
            break
    return "Sorry, I'm having trouble connecting right now."

Saving and Restoring Conversation State

Need persistence across sessions? Think support bots that remember context between page reloads. The solution is simpler than most people expect:

import json

def save_history(filepath: str):
    with open(filepath, "w") as f:
        json.dump(conversation_history, f, ensure_ascii=False, indent=2)

def load_history(filepath: str):
    global conversation_history
    try:
        with open(filepath, "r") as f:
            conversation_history = json.load(f)
    except FileNotFoundError:
        pass  # start fresh

No external database needed for single-user deployments. JSON on disk is plenty.

Practical Tips That Actually Matter in Production

Always Use Environment Variables for API Keys

Never hardcode credentials. Use environment variables or a .env file:

pip install python-dotenv

from dotenv import load_dotenv
import os

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

Pick the Right Model for the Job

gpt-4o-mini handles most chatbot use cases well — fast, cheap (~$0.15/1M input tokens), and sharp enough for general Q&A. Save gpt-4o for tasks that need deeper reasoning. A minimal router:

def get_model(task_complexity: str) -> str:
    if task_complexity == "complex":
        return "gpt-4o"
    return "gpt-4o-mini"

Write a Specific System Prompt

Vague prompts get vague results. Be explicit about scope, tone, and format:

SYSTEM_PROMPT = """You are a technical support assistant for Linux server issues.
- Answer only questions related to Linux, shell scripting, and server administration.
- When providing commands, always wrap them in code blocks.
- If you don't know the answer, say so directly — do not guess.
- Keep responses concise unless the user explicitly asks for detail."""

Log Inputs, Outputs, and Token Usage

When production breaks at 3 AM, you’ll want the conversation log and cost data ready. Add logging directly into your chat() function:

import logging

logging.basicConfig(
    filename="chatbot.log",
    level=logging.INFO,
    format="%(asctime)s | %(message)s"
)

def chat(user_message: str) -> str:
    conversation_history.append({"role": "user", "content": user_message})
    logging.info(f"USER: {user_message}")

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=conversation_history
    )

    assistant_reply = response.choices[0].message.content
    usage = response.usage
    conversation_history.append({"role": "assistant", "content": assistant_reply})
    logging.info(f"ASSISTANT: {assistant_reply}")
    logging.info(f"TOKENS: prompt={usage.prompt_tokens}, completion={usage.completion_tokens}, total={usage.total_tokens}")
    return assistant_reply

Set a daily alert if cumulative tokens exceed, say, 500K — that’s roughly $0.075 in input costs on gpt-4o-mini, a reasonable sanity check. Month-end billing surprises are far worse than a noisy alert.

You now have a working multi-turn chatbot: streaming, retry logic, persistence, and cost controls. Add a FastAPI layer to expose it as an API, wire it into a Telegram bot, or connect it to a vector database for RAG. The conversation loop stays exactly the same.