Opentrace evaluates its retrieval and generation quality using the RAGAs (Retrieval-Augmented Generation Assessment) framework — an open-source library purpose-built for measuring RAG pipeline performance without requiring hand-labelled ground-truth answers.
RAGAs uses an LLM-as-judge approach to cross-reference three pieces of information for every test case:
By reasoning over these three inputs, RAGAs can detect hallucinations, measure answer completeness, and score retrieval precision — all without a manually curated test set. In this setup, GPT-4o (temperature 0) acts as the judge LLM and text-embedding-3-large is used for embedding-based relevancy scoring.
| Metric | What it measures | Range |
|---|---|---|
| Faithfulness | Whether every factual claim in the answer can be traced back to the retrieved context. A score of 1.0 means zero hallucination — every statement is grounded in the provided documents. | 0 – 1 |
| Answer Relevancy | Whether the answer directly and completely addresses the question. Low scores indicate vague, off-topic, or excessively terse responses. Measured via cosine similarity between the question and synthetic back-generated questions from the answer. | 0 – 1 |
| Context Precision (disabled) | Fraction of retrieved chunks that are genuinely relevant. Requires ground-truth context — commented out in the current evaluation script. | 0 – 1 |
| Context Recall (disabled) | Whether all necessary information was retrieved from the knowledge base. Also requires ground-truth answers — disabled for the same reason. | 0 – 1 |
Faithfulness vs. Answer Relevancy: Faithfulness catches hallucination (did the LLM make things up?). Answer Relevancy catches completeness and focus (did the LLM actually answer the question?). Both can be high or low independently.
Evaluation runs in two sequential steps:
evaluation/scripts/collect_data.py fires each test question through the live RAG pipeline — the same retrieve_context and prepare_prompt_and_invoke_llm functions used in production. It collects the question, the retrieved context chunks, and the generated answer, then writes everything to a JSON dataset file.
cd server
poetry run python evaluation/scripts/collect_data.pyevaluation/scripts/run_evaluation.py loads the JSON dataset, constructs a HuggingFace Dataset, and calls ragas.evaluate with the configured metrics. Per-question scores are written to datasets/results.csv.
poetry run python evaluation/scripts/run_evaluation.pyevaluation/
├── __init__.py
└── scripts/
├── collect_data.py # Step 1 — data collection
├── run_evaluation.py # Step 2 — RAGAs scoring
└── datasets/
├── ragas_evaluation_dataset-1.json # Q / context / answer triples
└── results.csv # Per-question metric scoresThe initial evaluation covered 2 questions drawn from a mixed-topic knowledge base containing cosmology and neuroscience documents, using the basic vector search retrieval strategy.
| Question | Faithfulness | Answer Relevancy |
|---|---|---|
| What is the Big Bang theory? | 1.0 | 0.765 |
| How many neurons does the human brain contain? | 1.0 | 1.0 |
| Average | 1.0 | 0.883 |
Faithfulness: 1.0 (perfect) — Both answers were entirely grounded in the retrieved context. The RAG pipeline did not introduce any hallucinated facts.
Answer Relevancy: 0.883 (strong) — The neuroscience answer scored a perfect 1.0 because the response was concise and directly on-point. The Big Bang answer scored 0.765 because the LLM summarised only part of the context rather than addressing the broader scope that the question implied — the answer was factually correct but slightly narrower than expected.
High faithfulness + lower relevancy is the typical failure mode for RAG systems: the pipeline retrieves good context and stays grounded in it, but the LLM answer becomes overly conservative or narrow. Tuning the system prompt or increasing final_context_size in RAG settings can raise relevancy.
To evaluate a different project or a larger question set:
evaluation/scripts/collect_data.py and set PROJECT_ID to the Supabase project UUID you want to test.TEST_QUESTIONS list. A larger, more diverse set (20+ questions) gives statistically meaningful averages. The script ships with 25 commented-out example questions across cosmology, neuroscience, and history.collect_data.py to generate a fresh dataset JSON.run_evaluation.py — scores land in datasets/results.csv.Both scripts require a valid OPENAI_API_KEY in your .env file. The evaluation step calls GPT-4o as the judge LLM and incurs API costs proportional to the number of questions and context length.