Context & Why: Your AI Is Giving Garbage Outputs — and It’s Not the Model’s Fault
3:47 AM. The production monitoring dashboard is screaming. An AI-powered customer support bot — one we’d been proud of for months — is responding to user queries with complete nonsense. Users are getting error codes explained as cooking recipes. A complaint about a payment failure is being answered with tips on time management.
The logs show the model is working fine. No API errors. No timeouts. Just wrong outputs. I pulled up the system prompt we wrote four months ago: “You are a helpful assistant. Answer user questions.” That was it. Seven words. That was the entire configuration.
That night taught me more about prompt engineering than any course I’d taken. The model wasn’t broken. Our prompts were.
Prompt engineering is the practice of structuring your input to a language model to get consistent, accurate, and useful outputs. It’s not magic — it’s configuration. And like any configuration, getting it wrong in production will cost you sleep.
Why Prompts Matter More Than You’d Expect
Models like GPT-4, Claude, or Gemini are stateless. Every new request starts from zero — no memory of prior conversations, no awareness of your use case or constraints. The only context they have is what you put in the prompt.
The difference between a prompt that works and one that doesn’t can mean:
- 30% vs 95% accuracy on a classification task
- Responses that respect your output format vs responses you have to parse with regex hacks at 2 AM
- A customer support bot that actually helps vs one that confidently hallucinates policies
These techniques translate directly across providers. Whether you’re working with Claude, GPT-4, or Gemini, the underlying principles hold.
Installation: Your Prompt Testing Environment
Before you can tune prompts, you need a fast feedback loop. Don’t test prompts in production. Don’t test through a chat UI if you’re building an API integration. Build a minimal test harness first.
Install the SDK for whichever provider you’re using:
# Anthropic (Claude)
pip install anthropic
# OpenAI
pip install openai
Here’s a minimal Python script that lets you iterate on prompts in seconds:
import anthropic
client = anthropic.Anthropic(api_key="your-api-key-here")
def test_prompt(system_prompt: str, user_message: str) -> str:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": user_message}]
)
return response.content[0].text
# Quick iteration loop
system = """You are a technical support agent for a cloud storage service.
You help users troubleshoot upload failures, sync issues, and access problems.
Always ask one clarifying question before suggesting solutions.
Format responses as numbered steps when giving instructions."""
user_input = "My files aren't syncing"
print(test_prompt(system, user_input))
Run this, read the output, tweak the system prompt, run again. A feedback loop under 10 seconds beats any playground UI for serious prompt development.
For OpenAI, the pattern is nearly identical:
from openai import OpenAI
client = OpenAI(api_key="your-api-key-here")
def test_prompt(system_prompt: str, user_message: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
]
)
return response.choices[0].message.content
Configuration: Prompt Techniques That Actually Work
1. Role Assignment — Give the Model a Specific Identity
Vague prompts produce vague outputs. The fastest fix: give the model a specific identity with specific constraints.
Weak:
You are a helpful assistant.
Strong:
You are a senior Linux systems engineer with 10 years of experience.
You specialize in debugging production incidents on Ubuntu 22.04 and RHEL.
You communicate clearly and directly, assuming the person asking knows basic Linux.
When you don't know something, say so — don't guess.
Keep responses under 300 words unless the complexity requires more.
The second version locks in persona, expertise level, communication style, and response length. Each constraint narrows what the model can produce — which is exactly what you want when you’re running this at scale.
2. Specify Your Output Format Explicitly
Models will invent structure if you don’t define one. If you need JSON, say you need JSON and show the exact schema. If you need a single-word answer, make that explicit.
system = """Classify the following support ticket into exactly one category.
Return ONLY a JSON object with this exact structure, nothing else:
{"category": "billing|technical|account|other", "confidence": 0-100}
Do not include explanation, preamble, or markdown code fences."""
ticket = "I was charged twice for the same order last Tuesday."
result = test_prompt(system, ticket)
# Expected: {"category": "billing", "confidence": 95}
The phrase “nothing else” isn’t redundant — models have a persistent tendency to add explanatory text you didn’t ask for. Make format constraints explicit and firm.
3. Few-Shot Examples — Show, Don’t Just Describe
Some tasks are easier to show than explain. When the rules are fuzzy or context-dependent, examples do more work than instructions. That’s few-shot prompting.
Extract the main error type from these log lines.
Example 1:
Log: "ConnectionRefusedError: [Errno 111] Connection refused to 10.0.0.5:5432"
Error type: database_connection
Example 2:
Log: "PermissionError: [Errno 13] Permission denied: '/var/log/app.log'"
Error type: filesystem_permission
Example 3:
Log: "TimeoutError: Request to https://api.example.com timed out after 30s"
Error type: external_api_timeout
Now classify:
Log: {log_line}
Error type:
I’ve run this in production on a log classification pipeline handling thousands of entries daily. The results were consistent. Before adding examples, accuracy hovered around 70%. After three well-chosen examples, it jumped to 94% and held there.
Pick examples that cover edge cases, not just the happy path. If your model will see ambiguous inputs, your examples should too.
4. Chain-of-Thought for Complex Reasoning
For tasks that require multi-step reasoning, ask the model to work through the problem before giving an answer. The reasoning step matters — it forces the model to check its work rather than pattern-matching to a quick conclusion.
Determine if this support request should be escalated to a human agent.
Think through each criterion:
1. Is the user expressing frustration or anger?
2. Does the request involve a refund over $500?
3. Has the user mentioned reporting this issue previously?
4. Is there any mention of legal action or regulatory bodies?
If the answer to ANY criterion is yes, output: ESCALATE
Otherwise, output: HANDLE
Then give a one-sentence reason.
Request: {user_message}
Let me check each criterion:
The trailing “Let me check each criterion:” prompts the model to reason before concluding. Don’t skip it — it makes a measurable difference on complex classification tasks.
5. Negative Constraints — What NOT to Do
Telling a model what to avoid is as important as telling it what to do. Explicit negative constraints are your guard rails — they keep the model from confidently wandering into areas that could create legal exposure or just look bad to users.
You are a product documentation assistant.
Rules:
- Answer only questions about our product features and documentation
- Do NOT provide legal advice under any circumstances
- Do NOT make promises about future features or release timelines
- Do NOT discuss competitor products
- If a question is outside your scope, say: "That's outside what I can help with.
Contact [email protected] for further assistance."
- Keep all responses under 200 words
The explicit fallback phrase for out-of-scope questions prevents the model from improvising an answer when it should be deflecting.
Verification & Monitoring: Testing Prompts Like Production Code
A prompt is configuration. Test it like configuration — systematically, with known expected outputs.
Build a Test Suite
test_cases = [
{
"input": "How do I reset my password?",
"should_contain": ["password", "reset", "email"],
"should_not_contain": ["competitor", "I cannot help"]
},
{
"input": "I want to take legal action against your company",
"should_contain": ["understand", "[email protected]"],
"should_not_contain": ["legal advice", "you could sue", "liable"]
}
]
def evaluate_prompt(system_prompt: str, cases: list) -> dict:
results = {"passed": 0, "failed": 0, "failures": []}
for case in cases:
response = test_prompt(system_prompt, case["input"])
response_lower = response.lower()
passed = all(term.lower() in response_lower for term in case["should_contain"])
passed &= all(term.lower() not in response_lower for term in case.get("should_not_contain", []))
if passed:
results["passed"] += 1
else:
results["failed"] += 1
results["failures"].append({
"input": case["input"],
"response": response
})
return results
results = evaluate_prompt(system_prompt, test_cases)
print(f"Passed: {results['passed']}/{len(test_cases)}")
for failure in results["failures"]:
print(f"\nFAILED on: {failure['input']}")
print(f"Response: {failure['response']}")
Log Everything in Production
When something goes wrong — and something will — you need the exact prompt that was sent, not just what the model returned. Log the system prompt version, user input, and full model output together. Store prompt versions as constants with identifiers:
PROMPT_V1_2 = {
"version": "support-bot-v1.2",
"system": """You are a technical support agent..."""
}
def log_interaction(prompt_version, user_input, model_response):
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"prompt_version": prompt_version,
"input": user_input,
"output": model_response
}
# Write to your logging system
logger.info(json.dumps(log_entry))
Watch for Model Drift
LLM providers update models periodically. A prompt that worked reliably with gpt-4-turbo may behave differently with gpt-4o. Run your test suite as part of CI/CD, and schedule it to run weekly against production prompts — not just at deployment time.
The incident that cost us three hours and user trust that night? The fix was eleven lines added to a system prompt. Specific constraints, a clear role, an explicit output format, and a fallback instruction for edge cases. Eleven lines that should have been there from day one.
Write, test, iterate. The models aren’t the bottleneck — your prompts are. Fix the prompts, and the outputs follow.

