LLM Evaluation: Practical Methods and Tools to Measure AI Model Quality in Production

Table of Contents

Context & Why LLM Evaluation Actually Matters

You ship an AI feature. Users start complaining that the bot gives wrong answers. You look at your logs — the model ran, returned a 200 OK, latency looks fine. But the answer was garbage. That’s the gap LLM evaluation exists to close.

Traditional software testing is binary: does the function return the right value? LLMs don’t work that way. A model can give a fluent, confident-sounding response that is factually wrong, off-topic, or toxic. Without structured evaluation, you’re flying blind in production.

I learned this the hard way on a customer support bot. Response times were excellent, uptime was perfect, and I thought the system was solid — until a manager forwarded me a screenshot of the bot telling a customer that their warranty had expired when it hadn’t. The model hallucinated. Nothing in my monitoring pipeline caught it.

After rebuilding the evaluation layer from scratch, I’ve run this setup in production for about six months. Failing responses dropped from roughly 8% to under 2% of sampled traffic. These are the pieces that actually moved the needle.

What Are You Actually Measuring?

LLM evaluation splits into two broad categories:

Reference-based: You have a known correct answer and compare the model output against it (exact match, BLEU, ROUGE, semantic similarity).
Reference-free: No ground truth — you evaluate properties like coherence, relevance, faithfulness, or safety using another model as a judge (LLM-as-judge).

For most production apps, you’ll need both. Reference-based catches factual regressions; reference-free catches tone, format, and reasoning quality.

Installation: Setting Up Your Evaluation Stack

The two libraries I reach for most often are DeepEval (general-purpose LLM testing) and RAGAS (specialized for RAG pipelines). Install both into your project’s virtual environment:

pip install deepeval ragas openai anthropic

DeepEval integrates with pytest, so evaluations run as part of your CI pipeline — which is exactly where they belong. RAGAS is framework-agnostic and works with LangChain, LlamaIndex, or plain Python.

For RAG Applications Specifically

If your app does retrieval-augmented generation, RAGAS gives you four core metrics out of the box:

Faithfulness: Does the answer stay faithful to the retrieved context?
Answer Relevancy: Is the answer actually relevant to the question?
Context Precision: Were the right chunks retrieved?
Context Recall: Were all necessary chunks included?

pip install ragas datasets

Configuration: Defining Metrics and Test Cases

Building a DeepEval Test Suite

The most important step is building a golden dataset — a set of inputs with expected outputs or evaluation criteria. Don’t skip this. Even 50 well-chosen test cases catch more regressions than 1,000 random ones.

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, HallucinationMetric
from deepeval.test_case import LLMTestCase

# Define your test case
test_case = LLMTestCase(
    input="What is the warranty period for Product X?",
    actual_output="Product X comes with a 2-year warranty.",
    expected_output="Product X has a 2-year warranty from the date of purchase.",
    retrieval_context=[
        "Product X warranty: 2 years from date of purchase, covering manufacturing defects."
    ]
)

# Configure metrics
answer_relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.8)
hallucination = HallucinationMetric(threshold=0.2)  # lower is better

# Run evaluation
evaluate([test_case], [answer_relevancy, faithfulness, hallucination])

Threshold selection matters more than most tutorials admit. Too loose and you catch nothing. Too tight and every deploy triggers a false alarm — your team starts ignoring CI failures, which defeats the whole point. I start at 0.7 across the board, then tighten individual metrics as I learn which score ranges actually predict user complaints.

Running RAGAS on a RAG Pipeline

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

# Your pipeline outputs — collect these from actual runs
data = {
    "question": ["How do I reset my password?"],
    "answer": ["Go to Settings > Security > Reset Password and follow the steps."],
    "contexts": [[
        "To reset your password: navigate to Settings, click Security, then Reset Password."
    ]],
    "ground_truth": ["Users can reset their password via Settings > Security > Reset Password."]
}

dataset = Dataset.from_dict(data)

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

print(result)
# Output: {'faithfulness': 0.97, 'answer_relevancy': 0.89, ...}

LLM-as-Judge for Open-Ended Quality

Some things resist numeric metrics — tone, professionalism, cultural appropriateness. For these, I use a separate model as a judge. The key is writing a strict, unambiguous rubric:

import anthropic

client = anthropic.Anthropic()

def judge_response(question: str, answer: str) -> dict:
    prompt = f"""You are a strict quality evaluator. Score this AI response on a scale of 1-5.

Criteria:
- Accuracy (1-5): Is the information correct?
- Tone (1-5): Is it professional and helpful?
- Completeness (1-5): Does it fully answer the question?

Question: {question}
Answer: {answer}

Return ONLY a JSON object: {{"accuracy": X, "tone": X, "completeness": X, "reasoning": "brief explanation"}}"""

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}]
    )
    
    import json
    return json.loads(response.content[0].text)

# Usage
scores = judge_response(
    "What's the refund policy?",
    "We offer a 30-day money-back guarantee on all products."
)
print(scores)

One caution: LLM judges are not perfect. They tend to favor longer, more confident-sounding answers even when a shorter answer is more accurate. Spot-check judge outputs against human ratings at least once a month.

Verification & Monitoring: Keeping Quality Stable Over Time

Integrating Evaluation Into CI/CD

DeepEval ships with a pytest plugin. Put this in your test suite and it blocks deployments that regress below your thresholds:

# tests/test_llm_quality.py
import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

@pytest.mark.parametrize("test_case", load_golden_dataset())
def test_answer_quality(test_case):
    metric = AnswerRelevancyMetric(threshold=0.75)
    assert_test(test_case, [metric])

deepeval test run tests/test_llm_quality.py

I run this on every PR that touches prompt templates, retrieval logic, or the model version. It’s caught at least a dozen subtle regressions before they reached users.

Production Monitoring with Sampling

Running full evaluations on every production request is expensive. Sample a percentage of real traffic instead and evaluate asynchronously:

import random
import asyncio

async def maybe_evaluate(question: str, answer: str, context: list[str]):
    # Evaluate 5% of production traffic
    if random.random() > 0.05:
        return
    
    scores = await run_ragas_async(question, answer, context)
    
    # Log to your monitoring system
    log_metric("llm.faithfulness", scores["faithfulness"])
    log_metric("llm.answer_relevancy", scores["answer_relevancy"])
    
    # Alert if scores drop below threshold
    if scores["faithfulness"] < 0.6:
        send_alert(f"Faithfulness dropped to {scores['faithfulness']:.2f}")

Metrics to Track Over Time

Build a simple dashboard with these signals — even a spreadsheet works to start:

Faithfulness score (weekly average) — hallucination detector
Answer relevancy score — measures topic drift
User thumbs-up/thumbs-down ratio — ground truth from real users
Escalation rate — how often the bot fails and hands off to a human
Latency p95 — quality regressions often come with latency spikes when you switch models

Automated scores and user feedback are more reliable together than either is alone. I’ve seen RAGAS faithfulness sit above 0.9 while thumbs-down ratings climbed week over week — turns out the model was faithfully citing context that was six months out of date. Running both signals caught the problem within a week; neither would have flagged it on its own.

When Scores Degrade: A Debugging Checklist

Check if the base model was updated by your provider — silent changes happen more than vendors admit.
Review recent prompt template changes. Even adding a single sentence can shift behavior in ways that don’t show up until you look at the tail of the score distribution.
Check your retrieval pipeline — lower context precision means the model is getting irrelevant chunks.
Look at the actual failing test cases, not just aggregate scores. Patterns in failures reveal root causes faster than any summary metric.

Evaluation isn’t a one-time setup. It’s a feedback loop you maintain as your app grows and your models change. Start small — 20 well-labeled examples with two or three metrics beats an elaborate framework you never actually maintain. Add complexity only after you’ve identified which signals reliably predict real user satisfaction in your context.