The Problem with ‘Vibe-Based’ Evaluation
We’ve all been there. You build a Retrieval-Augmented Generation (RAG) system, ask it five questions, and because the answers look reasonable, you assume it’s production-ready. In the industry, we call this ‘vibe-based’ evaluation. It works for a weekend hobby project, but it is a recipe for disaster in a professional environment.
I learned this the hard way when a system that looked perfect in testing started hallucinating 40% discounts that didn’t exist once it hit real users. You cannot improve what you cannot measure. If you swap your embedding model or adjust your chunk size from 500 to 1,000 tokens, you need to know—with data—if the system actually improved. RAGAS (RAG Assessment) solves this by using an ‘LLM-as-a-judge’ approach to provide objective scores.
The RAG Triad: Metrics That Matter
RAGAS doesn’t just look at the final text. It analyzes the relationship between the user’s question, the retrieved documents, and the generated response. This creates a clear picture of where your pipeline is breaking down.
- Faithfulness: This measures how loyal the answer is to the retrieved context. If your document says “The limit is $500” but the LLM says “The limit is $5,000,” your faithfulness score will crash toward zero. It is your best tool for catching hallucinations.
- Answer Relevancy: This checks if the response actually solves the user’s problem. It ignores fluff and penalizes answers that are technically correct but miss the point of the query.
- Context Recall: This evaluates your retriever. It checks if the search engine actually found the specific information needed to answer the question, comparing the results against a human-verified ‘Ground Truth.’
Setting Up Your Environment
You will need a Python environment and an API key for an LLM provider. Since RAGAS uses an LLM to grade your system, running an evaluation on 100 queries typically costs between $0.50 and $2.00 depending on the model you choose. It is a small price to pay for automated QA.
Install the required packages to get started:
pip install ragas datasets openai langchain
Configure your environment variables. Using a .env file is best practice, but for a quick test, you can set them in your script:
import os
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
Building an Evaluation Dataset
RAGAS doesn’t sit inside your live application code. Instead, you feed it a dataset of your system’s recent outputs. To get a full report, you need to organize your data into four specific columns.
- Question: The original user input.
- Answer: The text generated by your RAG pipeline.
- Contexts: The specific text chunks your retriever pulled from the database.
- Ground Truth: The ideal, human-written answer used as a benchmark.
Here is how to structure this using the datasets library:
from datasets import Dataset
data_samples = {
'question': ['What is the return policy for electronics?'],
'answer': ['You can return electronics within 30 days.'],
'contexts': [['Our policy allows 30 days for electronics returns with a receipt.']],
'ground_truth': ['Electronics must be returned within 30 days and require a receipt.']
}
dataset = Dataset.from_dict(data_samples)
Running the Evaluation
With your data ready, you can trigger the evaluation. I typically run these tests in batches whenever I change a retrieval parameter, such as switching from a basic vector search to a Hybrid Search with BM25.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
# Run the scoring process
results = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_recall],
)
df = results.to_pandas()
print(df)
The output provides a score between 0 and 1 for each category. A Faithfulness score of 0.95 is excellent. If you see a score of 0.40, your LLM is likely ignoring the provided documents and relying on its own internal training data instead.
Turning Scores into Action
The real value of RAGAS is diagnostic. If your Context Recall is low but Faithfulness is high, your LLM is being honest, but your search engine is failing. You might need to improve your chunking strategy or upgrade your embedding model from text-embedding-3-small to 3-large.
Conversely, if Answer Relevancy is low, your retriever found the right data, but the LLM got lost in the weeds. In this case, I usually simplify the system prompt or use a more powerful model like GPT-4o for the final generation step.
Scaling with Local Models
If you are worried about costs while evaluating thousands of rows, you can swap the evaluator to a local model using Ollama. This allows you to run massive benchmarks locally without sending data to an external API.
from langchain_openai import ChatOpenAI
from ragas.metrics import faithfulness
# Use a specific high-performance model for grading
faithfulness.llm = ChatOpenAI(model="gpt-4o")
Standardizing these metrics changed how I develop AI. Instead of guessing, I can now prove that a specific prompt tweak increased accuracy by 12%. Start by creating a ‘Golden Dataset’ of 20 high-quality questions. Run RAGAS every time you touch the code, and you will catch regressions before your users do.

