Stop Guessing: Automate Your Prompt Testing with Promptfoo

Table of Contents

The Trap of “Vibe-Based” Prompt Engineering

Most developers start building AI apps by tweaking prompts in a playground. You change a sentence, hit “Run,” and if the output looks decent, you ship it. This workflow holds up for exactly one day. Eventually, you’ll change a single word to fix one edge case, only to realize you’ve broken three other features in the process. This is a prompt regression, and it’s the fastest way to lose user trust.

I’ve been there—staring at a spreadsheet of 50 model responses for four hours, trying to spot why the tone shifted. It is soul-crushing work. Promptfoo fixes this by treating prompts like code. It replaces manual audits with automated unit tests and benchmarks, turning guesswork into data.

Quick Start (5 min)

Promptfoo is a CLI tool designed to run your prompts against a set of test cases. It evaluates the results automatically so you don’t have to. Let’s get it running.

1. Installation

You can run Promptfoo via npx or install it globally to make it available anywhere:

npm install -g promptfoo

2. Initialize your project

Create a dedicated folder for your evaluations and scaffold the configuration:

mkdir prompt-tests && cd prompt-tests
promptfoo init

This command generates a promptfooconfig.yaml file. This file acts as your mission control, defining which models to test and what success looks like.

3. Run your first evaluation

The init command includes a placeholder test to get you started. Ensure your API key is exported in your terminal session:

export OPENAI_API_KEY=your_api_key_here
promptfoo eval

Once the run finishes, spin up the local web UI to visualize the results:

promptfoo view

Deep Dive: Building a Real Test Suite

Basics are fine, but the real power comes from testing complex logic. Imagine you are building a “Support Ticket Classifier” for a company handling 500 requests a day.

Step 1: Define the Prompts

You can compare multiple prompt versions simultaneously. Create prompts.txt to hold your instructions:

Classify the following support ticket into one of these categories: Technical, Billing, General. 
Ticket: {{ticket_body}}
Category:

Step 2: Configure the Providers

Maybe you want to see if GPT-4o-mini ($0.15/1M tokens) is accurate enough, or if you need the power of GPT-4o ($5.00/1M tokens). Update your promptfooconfig.yaml:

providers:
  - openai:gpt-4o-mini
  - openai:gpt-4o

prompts:
  - file://prompts.txt

Step 3: Add Test Cases and Assertions

Assertions are where Promptfoo outshines a spreadsheet. Instead of eyeballing text, you define hard rules for success.

tests:
  - vars:
      ticket_body: "I can't log into my account even after resetting my password."
    assert:
      - type: icontains
        value: Technical
  - vars:
      ticket_body: "I was charged twice for my subscription this month."
    assert:
      - type: icontains
        value: Billing
  - vars:
      ticket_body: "Hello, I just wanted to say thanks for the great service!"
    assert:
      - type: llm-rubric
        value: "The output should be 'General' and should not contain any aggressive language."

Think of llm-rubric as hiring a smarter model to grade a cheaper one. It uses an LLM call to judge qualitative criteria like tone, sentiment, or accuracy. It’s perfect for checks that regex simply can’t handle.

Advanced Workflow: Production Readiness

Moving beyond local tests requires integrating Promptfoo into your actual development lifecycle.

1. CI/CD Integration

Never merge a prompt change that breaks your “Golden Dataset.” By running Promptfoo in GitHub Actions, you can automatically block PRs if accuracy scores drop below your threshold.

# .github/workflows/prompt-test.yml
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npx promptfoo eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

2. Red Teaming and Security

Promptfoo includes built-in security auditing. It can generate hundreds of adversarial inputs to try and trick your prompt into leaking system instructions or generating toxic content. Run promptfoo redteam init to start scanning for vulnerabilities.

3. Model Comparison

Provider lock-in is a risk. Promptfoo makes it trivial to compare OpenAI against Anthropic’s Claude 3.5 Sonnet or local models like Llama 3 via Ollama. You can find the exact sweet spot between cost, speed, and intelligence.

Practical Tips for the Real World

Version your prompts: Store your .txt prompt files in Git right next to your logic. This ensures your code and prompts stay in sync.
Build a “Golden Dataset”: Start with 20 real-world examples of inputs and their ideal outputs. This is your baseline for all future improvements.
Use Caching: LLM calls are expensive. Promptfoo caches results by default. If you don’t change a prompt or a test case, it won’t re-run the call, saving you both money and time.
Minimize LLM-grading: While llm-rubric is powerful, it adds cost. Use deterministic checks like is-json or javascript assertions whenever possible.
Monitor Latency: Add assertions for latency. A prompt that is 100% accurate but takes 15 seconds to respond is often useless in a real-time UI.

Systematic testing separates “toy” AI projects from production-grade applications. By ditching manual audits for tools like Promptfoo, you gain the freedom to iterate without the fear of breaking your user experience. Give it a shot—your future self will thank you for the saved debugging hours.