A Complete Guide to Unit Testing RAG in Continuous Development Workflow

We’ve a gem of a a deep tech dive for you this time around from the tech team here at Griffin AI, as we learn about a subject close to the hearts of our team deep in the lab, the design and testing of Retrieval Augmented Generation. This crucial topic is deeply involved with the creation and deployment of AI agents since it allows them to draw on additional data and context before producing a response. Let’s hear how it’s done on the ground.

The most popular way to give LLMs added context

Generation of Retrieval-Augmented Generation (RAG) has become the most popular method for providing LLMs with additional context to produce tailored outputs. This approach is ideal for LLM applications such as chatbots or AI agents, as RAG offers users a significantly more contextual experience that goes beyond the data on which the LLMs, like GPT-4, were trained.

It's no surprise that LLM practitioners have faced challenges in evaluating RAG applications during development. However, thanks to research conducted by RAGA, assessing the general characteristics of generator-retriever RAG systems in 2024 is now a somewhat resolved issue. Nonetheless, building RAG applications remains a challenge—you might use the wrong embedding model, a poor chunking strategy, or output answers in an incorrect format, which frameworks like LlamaIndex aim to address.

Now, as RAG architectures become increasingly complex and collaboration among LLM experts on these projects intensifies, the occurrence of critical changes is more frequent than ever.

In this guide, we’ll explore how to set up a fully automated evaluation/testing suite for modular testing of RAG applications in your CI/CD pipelines. Ready to learn how to configure the perfect RAG development workflow?

A brief overview of RAG evaluation

A typical RAG architecture consists of two key components:

Retriever – a module that performs vector search in your knowledge base to extract relevant contexts.

Generator – a module that takes the retrieved contexts from the retriever, constructs a prompt, and generates a customized LLM response as the final output.

A high-performance RAG system stems from high-performance retrievers and generators. Therefore, RAG evaluation metrics typically focus on assessing these two components. The core assumption is that a RAG application can only succeed if the retriever effectively extracts the correct and relevant contexts, and the generator can efficiently leverage these contexts to produce the desired outputs (i.e., factually correct and relevant results).

Common RAG metrics

For the reasons mentioned above, RAG evaluation metrics tend to center around the retriever and the generator. Specifically, RAGA is a popular method for evaluating RAG’s overall performance and offers the following retrieval metrics. Let’s explore some common RAG evaluation metrics:

Contextual completeness

Contextual completeness measures how well the retrieved context encapsulates the information needed to generate the expected output. This metric pertains to the retriever and requires the expected result as the target label. This may confuse some, but the reason the expected result is needed is that it makes no sense to use the actual result as the ground truth. Consider this: how can you judge the quality of the retrieved context if you don’t know what the ideal outcome should look like?

Contextual accuracy

Contextual accuracy evaluates how well your RAG retriever ranks the retrieved context based on relevance. This is crucial because LLMs tend to prioritize nodes closer to the end of the prompt template. Poor re-ranking can cause the LLM to focus on the “wrong” retrieval nodes, potentially leading to hallucinations or irrelevant responses.

Response relevance

Response relevance measures how well your RAG generator, often just an LLM along with its prompt, produces relevant answers. Note that response relevance is directly tied to the retriever’s quality, as the RAG pipeline requires information from the retrieved context to generate outputs. If the retrieved context is misleading or irrelevant, you are guaranteed to get less relevant answers.

Factuality

Factuality assesses the degree of hallucinations produced by your RAG generator, using the retrieved context as the ground truth. Similar to response relevance, the degree of factuality depends on the relevance of the retrieved context.

RAG metrics are not perfect

Beyond their effectiveness, the strongest aspect of these metrics is their use-case-agnostic nature. Whether you are building a chatbot for financial advisors or a data extraction application, these metrics will work as expected. Ironically, despite this use-case independence, you will soon find that these metrics are ultimately too general. A financial advisory chatbot might require additional metrics, such as bias in handling client data, while a data extraction application might need metrics to ensure the output adheres to a JSON format.

In addition to standard RAG evaluation metrics, you can integrate advanced LLM evaluation metrics into your RAG evaluation pipeline using DeepEval Python library.

Unit testing RAG applications with DeepEval

DeepEval is an open-source evaluation framework for LLMs, often referred to as unit testing for LLMs or a comprehensive testing suite for LLMs. In the previous section, we explored how to use RAG evaluation metrics and additional case-specific metrics to assess RAG applications. Here, we’ll walk through a full example of unit testing with DeepEval.

Prerequisites

Unlike other workflows, the primary goal of evaluation in CI/CD pipelines is to safeguard against critical changes introduced to your RAG application for a specific Git commit. Therefore, a static evaluation dataset that does not reflect the changes made to your RAG application won’t suffice.

You don’t need to pre-prepare an evaluation dataset containing the actual outputs and retrieval contexts of your RAG application. Instead, prepare a set of input data and expected outputs that you want to use to test the corresponding actual outputs of your RAG application. During the evaluation process, the RAG application will be executed.

Testing process consists of the step below:

Create a test file

To start, create a test file that will serve as the foundation for running automated evaluations of your RAG application. This file will define the structure of the tests and allow for seamless integration into CI/CD pipelines. It should include the necessary imports, specify test parameters, and be structured to validate retrieval accuracy, response relevance, and factuality. Each test will compare expected outputs to actual results, ensuring that modifications to your RAG system do not introduce unintended regressions.

Initialize evaluation metrics

Initialize the evaluation metrics in the newly created test file. Each metric in DeepEval produces a score ranging from 0 to 1, and a metric is considered successful only if the score meets or exceeds the threshold value, a test case is considered successful only if all metrics pass.

Define input and expected output data

Here, you will specify the set of input data on which you want to run your RAG application during evaluation.

Finally, run the test file via the CLI.

A few things to note

Most metrics are evaluated using OpenAI GPT models by default, so make sure to set your OpenAI API key as an environment variable.

You can define passing thresholds and specify the evaluation model you want to use for each metric.

A test example is considered successful only when all evaluation metrics are successful.

You can include as many metrics as you want. A complete list of metrics can be found here.

For cleaner code, you can import pairs of input and expected output data from CSV/JSON files.

The actual context of the output and extraction is generated dynamically. In the example above, we used a hypothetical RAG implementation, but you will need to replace it with your own RAG application. (LlamaIndex users can find a great example in the DeepEval documentation).

Unit testing RAG in CI/CD pipelines

Good news: you've already done 99% of the required heavy lifting. Now all that's left is to include the deepeval test command in your CI/CD environment. The main point here - everything will be ready once you set up the correct environment variables and include the deepeval test command in your CI/CD environment.

When all is said and done

Existing RAG evaluation metrics like RAGA are great for assessing the performance of a general-purpose retriever-generator but often fall short for use case-specific applications. Moreover, evaluations are not just a health check but a measure to safeguard against critical changes, especially in a collaborative development environment. Therefore, incorporating evaluations into CI/CD pipelines is crucial for any serious organization developing RAG applications.

If you want to implement your own evaluation metrics to address the shortcomings of universal RAG metrics and are looking for a production-grade testing framework to include in CI/CD pipelines, DeepEval is an excellent option. It includes over 14 evaluation metrics, supports parallel test execution, and is deeply integrated with Confident AI, the world's first open-source evaluation infrastructure for LLMs. Explore the Griffin AI community, visit our agents in the AI Playground, and engage with our journey.