If you’ve spent any time wrangling prompts for a production LLM pipeline, you know the cycle: tweak a sentence, test, tweak again, hope the model reads your intent the way you meant it. I’ve been there — three weeks into a project optimizing a retrieval-augmented QA system, and half that time was prompt archaeology. Then I found DSPy, and the workflow shifted completely.
What follows is a practical breakdown: where DSPy genuinely wins, where it doesn’t, and how to get a working pipeline off the ground.
Manual Prompt Engineering vs DSPy’s Declarative Approach
Traditional prompt engineering works like this: you write a string, embed your task instructions, add a few examples, and hope the model interprets your intent correctly. When accuracy drops — because you swapped models, changed domains, or your data distribution shifted — you go back and rewrite the prompt by hand.
DSPy (Declarative Self-improving Python) takes a different stance. Instead of writing prompts, you define what you want — the input fields, output fields, and constraints — and let DSPy figure out how to phrase the instructions. It compiles your program into optimized prompts using a set of optimizers called Teleprompters.
Here’s the key difference in practice:
- Manual approach: You own the prompt string. Model changes break things. Optimization is human-driven and slow.
- DSPy approach: You define a signature (inputs/outputs). DSPy generates and optimizes the prompt automatically based on labeled examples and metric functions.
The analogy that helped it click for me: DSPy is to prompts what an ORM is to SQL. You’re still working with the same underlying system, but the abstraction layer handles the tedious parts so you can focus on what the code is actually doing.
Pros and Cons: Honest Assessment After Production Use
What DSPy Gets Right
- Model-agnostic optimization: Swap GPT-4 for Claude or Llama without rewriting your prompts. The signatures stay the same; DSPy re-optimizes for the new model.
- Reproducible pipelines: Instead of a folder of markdown prompt files, you have version-controlled Python code with explicit logic.
- Composable modules: Chain
dspy.ChainOfThought,dspy.ReAct, and custom modules like Lego blocks. Each module handles its own prompt generation. - Metric-driven improvement: Define a success metric (exact match, F1, custom scorer), provide a small labeled dataset, and DSPy’s optimizers like
BootstrapFewShotorMIPROsearch for better prompts automatically.
On a document classification pipeline I shipped to production, accuracy held steady at around 89% F1 through two GPT-4 version bumps. Before DSPy, each model update meant at least a day of manual prompt tuning.
Where It Falls Short
- Learning curve: DSPy introduces its own abstractions — Signatures, Modules, Teleprompters. Plan for a day or two of friction before the mental model clicks.
- Optimization cost: Running
BootstrapFewShotWithRandomSearchmakes multiple LLM calls. A single run over 30 training examples on GPT-4 can cost $5–$20 depending on dataset size. Cache aggressively. - Debugging is harder: When your pipeline misbehaves, the generated prompt is a few layers down. You need to know where to look.
- Small datasets work, tiny datasets don’t: Optimizers need enough labeled examples to find signal. Fewer than 20 examples and results get noisy.
Recommended Setup
Get your environment clean before touching any optimizer. DSPy works with Python 3.9+ and supports OpenAI, Anthropic, Google, local Ollama, and more out of the box.
# Create a virtual environment
python -m venv venv
source venv/bin/activate
# Install DSPy
pip install dspy-ai
# For OpenAI backend
pip install openai
# For Anthropic backend
pip install anthropic
Configure the language model next. DSPy uses a global LM object — set it once at the top of your script and every module picks it up automatically:
import dspy
# OpenAI
lm = dspy.LM('openai/gpt-4o-mini', api_key='sk-...')
# Or Anthropic
# lm = dspy.LM('anthropic/claude-3-haiku-20240307', api_key='sk-ant-...')
# Or local Ollama
# lm = dspy.LM('ollama_chat/llama3', api_base='http://localhost:11434')
dspy.configure(lm=lm)
In production, pull the API key from environment variables. Never hardcode credentials in source files.
import os
lm = dspy.LM('openai/gpt-4o-mini', api_key=os.environ['OPENAI_API_KEY'])
dspy.configure(lm=lm)
Implementation Guide: Building an Optimized Pipeline
Step 1 — Define a Signature
A Signature tells DSPy what goes in and what comes out. No prompt text — just field names and optional descriptions:
import dspy
class ClassifySupport(dspy.Signature):
"""Classify a customer support ticket into a category."""
ticket_text: str = dspy.InputField(desc="Raw text of the customer support ticket")
category: str = dspy.OutputField(desc="One of: billing, technical, account, other")
confidence: str = dspy.OutputField(desc="high, medium, or low")
That docstring becomes part of the generated prompt. Keep it precise — vague instructions here produce vague outputs.
Step 2 — Build a Module
Wrap your signature in a module. Use dspy.ChainOfThought when intermediate reasoning steps help accuracy:
class SupportClassifier(dspy.Module):
def __init__(self):
self.classify = dspy.ChainOfThought(ClassifySupport)
def forward(self, ticket_text):
return self.classify(ticket_text=ticket_text)
# Test it without optimization first
classifier = SupportClassifier()
result = classifier(ticket_text="I was charged twice this month. Please refund.")
print(result.category) # billing
print(result.confidence) # high
Step 3 — Optimize with a Teleprompter
This is where DSPy earns its keep. Prepare a labeled dataset and a metric function, then run an optimizer:
from dspy.evaluate import Evaluate
from dspy.teleprompt import BootstrapFewShot
# Labeled examples
trainset = [
dspy.Example(ticket_text="Refund request for duplicate charge", category="billing").with_inputs("ticket_text"),
dspy.Example(ticket_text="App crashes on iOS 17", category="technical").with_inputs("ticket_text"),
dspy.Example(ticket_text="How do I reset my password?", category="account").with_inputs("ticket_text"),
dspy.Example(ticket_text="When does my subscription renew?", category="billing").with_inputs("ticket_text"),
dspy.Example(ticket_text="Cannot connect to the API", category="technical").with_inputs("ticket_text"),
# Add 15-30 examples for reliable results
]
# Metric: exact match on category
def accuracy_metric(example, prediction, trace=None):
return example.category.lower() == prediction.category.lower()
# Run the optimizer
teleprompter = BootstrapFewShot(metric=accuracy_metric, max_bootstrapped_demos=4)
optimized_classifier = teleprompter.compile(SupportClassifier(), trainset=trainset)
# Save the optimized program
optimized_classifier.save("support_classifier_optimized.json")
After optimization, DSPy has automatically selected the best few-shot examples and may have adjusted the instruction structure. Load it later without re-running the optimizer:
classifier = SupportClassifier()
classifier.load("support_classifier_optimized.json")
Step 4 — Chaining Multiple Modules
Real pipelines usually have more than one step. DSPy composes well:
class SummarizeTicket(dspy.Signature):
"""Summarize a support ticket in one sentence."""
ticket_text: str = dspy.InputField()
summary: str = dspy.OutputField()
class FullSupportPipeline(dspy.Module):
def __init__(self):
self.summarize = dspy.Predict(SummarizeTicket)
self.classify = dspy.ChainOfThought(ClassifySupport)
def forward(self, ticket_text):
summary = self.summarize(ticket_text=ticket_text).summary
classification = self.classify(ticket_text=summary)
return classification
Debugging Tips
Unexpected output? Inspect the actual prompt DSPy generated before assuming your logic is wrong:
# Enable verbose mode to see what's being sent to the LLM
with dspy.context(lm=lm):
result = classifier(ticket_text="My card was declined")
print(dspy.inspect_history(n=1)) # Shows last LLM call
One thing to know: DSPy caches LLM calls by default. During development this saves money. In production, make sure you understand when the cache invalidates if your underlying data changes.
When to Reach for DSPy
DSPy makes the most sense when:
- Your pipeline has multiple chained LLM calls that are hard to tune independently
- You need to swap underlying models without regression
- You have labeled data (or can generate it) to drive optimization
- Prompt stability across releases matters more than raw throughput
For one-off scripts or simple single-call tasks, the overhead isn’t worth it. A well-crafted manual prompt ships faster. But for anything that’ll be maintained and evolved over months, the declarative approach pays for itself — lower maintenance burden, fewer late-night debugging sessions, and model updates that don’t break your pipeline.
The optimizer runs once. After that, your pipeline is just Python code you can test, version, and deploy like any other service.

