RAG Evaluation

Opentrace evaluates its retrieval and generation quality using the RAGAs (Retrieval-Augmented Generation Assessment) framework — an open-source library purpose-built for measuring RAG pipeline performance without requiring hand-labelled ground-truth answers.

What is RAGAs?

RAGAs uses an LLM-as-judge approach to cross-reference three pieces of information for every test case:

The user question
The retrieved context chunks from the vector store
The generated answer from the LLM

By reasoning over these three inputs, RAGAs can detect hallucinations, measure answer completeness, and score retrieval precision — all without a manually curated test set. In this setup, GPT-4o (temperature 0) acts as the judge LLM and text-embedding-3-large is used for embedding-based relevancy scoring.

Metrics

Metric	What it measures	Range
Faithfulness	Whether every factual claim in the answer can be traced back to the retrieved context. A score of 1.0 means zero hallucination — every statement is grounded in the provided documents.	0 – 1
Answer Relevancy	Whether the answer directly and completely addresses the question. Low scores indicate vague, off-topic, or excessively terse responses. Measured via cosine similarity between the question and synthetic back-generated questions from the answer.	0 – 1
Context Precision (disabled)	Fraction of retrieved chunks that are genuinely relevant. Requires ground-truth context — commented out in the current evaluation script.	0 – 1
Context Recall (disabled)	Whether all necessary information was retrieved from the knowledge base. Also requires ground-truth answers — disabled for the same reason.	0 – 1

Note

Faithfulness vs. Answer Relevancy: Faithfulness catches hallucination (did the LLM make things up?). Answer Relevancy catches completeness and focus (did the LLM actually answer the question?). Both can be high or low independently.

Evaluation Pipeline

Evaluation runs in two sequential steps:

Step 1 — Data Collection

evaluation/scripts/collect_data.py fires each test question through the live RAG pipeline — the same retrieve_context and prepare_prompt_and_invoke_llm functions used in production. It collects the question, the retrieved context chunks, and the generated answer, then writes everything to a JSON dataset file.

bash

cd server
poetry run python evaluation/scripts/collect_data.py

Step 2 — Scoring

evaluation/scripts/run_evaluation.py loads the JSON dataset, constructs a HuggingFace Dataset, and calls ragas.evaluate with the configured metrics. Per-question scores are written to datasets/results.csv.

bash

poetry run python evaluation/scripts/run_evaluation.py

File Structure

text

evaluation/
├── __init__.py
└── scripts/
    ├── collect_data.py                        # Step 1 — data collection
    ├── run_evaluation.py                      # Step 2 — RAGAs scoring
    └── datasets/
        ├── ragas_evaluation_dataset-1.json    # Q / context / answer triples
        └── results.csv                        # Per-question metric scores

Evaluation Results

The initial evaluation covered 2 questions drawn from a mixed-topic knowledge base containing cosmology and neuroscience documents, using the basic vector search retrieval strategy.

Question	Faithfulness	Answer Relevancy
What is the Big Bang theory?	1.0	0.765
How many neurons does the human brain contain?	1.0	1.0
Average	1.0	0.883

Interpreting the Scores

Faithfulness: 1.0 (perfect) — Both answers were entirely grounded in the retrieved context. The RAG pipeline did not introduce any hallucinated facts.

Answer Relevancy: 0.883 (strong) — The neuroscience answer scored a perfect 1.0 because the response was concise and directly on-point. The Big Bang answer scored 0.765 because the LLM summarised only part of the context rather than addressing the broader scope that the question implied — the answer was factually correct but slightly narrower than expected.

Tip

High faithfulness + lower relevancy is the typical failure mode for RAG systems: the pipeline retrieves good context and stays grounded in it, but the LLM answer becomes overly conservative or narrow. Tuning the system prompt or increasing final_context_size in RAG settings can raise relevancy.

Running Your Own Evaluation

To evaluate a different project or a larger question set:

Open evaluation/scripts/collect_data.py and set PROJECT_ID to the Supabase project UUID you want to test.
Add your questions to the TEST_QUESTIONS list. A larger, more diverse set (20+ questions) gives statistically meaningful averages. The script ships with 25 commented-out example questions across cosmology, neuroscience, and history.
Run collect_data.py to generate a fresh dataset JSON.
Run run_evaluation.py — scores land in datasets/results.csv.

Warning

Both scripts require a valid OPENAI_API_KEY in your .env file. The evaluation step calls GPT-4o as the judge LLM and incurs API costs proportional to the number of questions and context length.

Was this page helpful?

PreviousDatabase Schema NextTroubleshooting