Four techniques for testing generative AI applications

John Gluck
August 26, 2024

A new frontier of generative AI-based apps is opening, and the pioneering engineers building them will need new ways of regression testing. As challenging as building the new products will be, automated testing may be even more complex. 

The complexities stem from two main characteristics of large language models (LLMs):

  1. They have the potential to be much more resource-intensive (read “expensive”) than any other kind of application, and
  2. Their output is probabilistic (i.e., somewhat random), which means that in order to go fast, you need to figure out how to automate a human’s ability to determine if an LLM app’s response falls within some acceptable range of correctness. 

At QA Wolf, we’ve been building automated black-box regression tests for generative AI-based applications for a while now. We’ve learned how to build deterministic tests for non-deterministic outputs, and we'd like to share what works and what doesn't.

Technique 1: Deterministic assertions to validate non-deterministic outputs

We’ve grown accustomed to the ease at which we can assert the state of the UI through the DOM. However, the industry lacks the tools and libraries to help us with some of the assertion patterns we’re encountering in the new automated AI testing world. Until then, we’re just going to have to code them ourselves.

Compare snapshots with visual diffing

For LLMs that output images or canvas elements, visual diffing is effective for testing whether changes to the underlying model have affected what the LLM returns and whether the application correctly renders data returned by the LLM. 

The QA Wolf application takes a screenshot of the AI-generated object and compares it to a known “correct” baseline or “golden master.” The test fails if the difference between the two images exceeds a pre-defined threshold set by the automator. 

Match golden master results to a labeled dataset

This is another form of golden master testing in which you compare your current result to a previously executed result that is known to be good. Your team can export an AI-generated object as structured data (e.g., SVG, XML) to set the master and then compare subsequent results to that master. This approach can be helpful when dealing with more cumbersome methods of data extraction, such as when you need to target specific x,y coordinates in a canvas API. 

Depending on the artifact, you can deduce or infer the accuracy of an output with varying degrees of confidence from contextual clues (i.e., heuristic function). For example, the number of bullet points in a list or even simply the word count can indicate the length of a response. The use of relative x and y positions of multiple canvas elements to infer the shapes of diagrams. 

Use allow/deny lists

Allow/deny lists of words, phrases, and images can help your team test prompt filters, which prevent users from generating inappropriate or unsafe material.

To test the accuracy of the application’s deny list, you might raise the temperature hyperparameter to a high setting (the highest is 2.0, which is the most random) and then give the application the same prompt multiple times, looking for occurrences of any entry in the deny list each time and failing if a match is found. 

Technique 2: Using AI oracles to make deterministic assessments 

From the user’s perspective, the most important thing to test will be the accuracy of the output. In the days of yore, we used simplistic calculations to determine if an application returned an acceptable result, and we liked it that way. 

Testing the accuracy of an LLM's output becomes trickier with more complex and subjective prompts. For example, asking it to generate colorful, family-friendly logo ideas for a petting zoo's t-shirt raises subjectivity. The term "family-friendly" covers many aspects like colors and subjects, and people often disagree on what it means, making it hard for computers to evaluate such subjective criteria reliably.

Using AI to test AI introduces some obvious (and interesting) problems, but these techniques help us support some of our more unique and complex clients. 

Multiple choice quiz

Here’s how it works: The test captures an output artifact (screenshot, text block, etc.) and then uploads the artifact to an LLM. The test asks the LLM to analyze the image and answer a multiple-choice quiz specific to the expected output with a single correct answer. The oracle asserts the LLM chose the correct answer.

Self-criticism

In self-criticism, the test first provides the application with an input and captures the output. The test then sends back the initial input and the captured output along with a rubric that a human would use to evaluate output accuracy to the LMM. The test asks the AI to score the result on several dimensions (e.g., "On a scale of 1-10, how well does this image satisfy the above criteria?"). Essentially, you are building a simple generative AI application to test your generative AI application. Your prompt template should limit the AI to a narrow range of responses easily interpreted by your oracle, meaning you might consider testing your tests

As far as we’ve determined, you don’t increase the accuracy of the results by using different LLMs — we haven’t noticed a difference between OpenAI and Anthropic  and we haven’t used any others.

Technique #3: Making your test suite less flaky

Testing generative AI is a flaky business. The randomness of the outputs, as well as the integrations between the apps and the LLMs, create plenty of places for tests to fail. But there are ways to make results more reliable and consistent, giving them a sense of predictability. While we can't always make them completely predictable, we can make them stable enough for automated testing to work smoothly.

Use seeds

A seed tells the LLM to spit out the same value every time. Your team can then use the seed to guarantee consistency across different test runs of the same model. You can also designate different seeds for other tests to evaluate the model's performance across various input conditions.

When doing automated regression tests on generative AI tools, we’ve found it’s better to set global seeds in the model in a warmup phase rather than creating a new form input field that exposes the seed. Alternatively, teams could insert the seed value into the POST request as long as their API accepts it — that would allow the team to manage the seed without exposing a UI field for it. 

Lower the temperature

Temperature is an LLM hyperparameter that controls the randomness of the generated output in a generative model. It influences the diversity of the generated text. Lowering the temperature will make the LLM’s response less likely to “hallucinate” or go off script and less likely to change. With the temperature lowered, your oracle can be more deterministic, making your automated tests less likely to fail. 

Technique 4: You can control costs without compromising product quality or delivery 

Continuous deployment necessitates frequent testing, preferably on every build. But unless your company has a money tree in the lobby, that strategy isn’t going to work for your generative AI app. As we’ve seen, generative AI tests will be slower and more expensive than your conventional end-to-end tests. To reign in the expense, you have to sacrifice a little bit of confidence, but unless your organization can significantly increase its testing budget, you may not have much agency. If you do have to watch expenses, here are a few strategies to consider: 

Separate runs for deterministic and probabilistic SUTs

The overview briefly mentioned that your generative AI application will have features your team can test with conventional methods. It makes sense for the team to run those tests frequently, and preferably on every check-in like you are, hopefully, already doing. 

For tests that exercise the models, try running those nightly or even weekly, depending on your cost and time sensitivity. You’ll sacrifice the instant feedback you get from the conventional end-to-end tests that run on each PR. Still, you probably don’t need to test the generative AI functionality unless the PR changes the underlying model.

Use spot check sampling

Instead of running all of the generative AI tests (that hopefully cover at least 80% of your application) with every check-in, run a random sample — say 20%. Over time, all your tests will get exercised, and those random samples will provide sufficient coverage.

Use intelligent sampling 

Based on what changed in the application, choose the 5–10 workflows most likely to be affected. If you find defects in that sample, continue expanding the test suite. 

Roll-out changes slowly

Limit the end-user exposure to any changes you make with slow and carefully monitored rollouts to production. Use A/B tests with user feedback to measure improvements or degradations in the model’s performance.

Questions? Let us know how we can help!

We hope this guide will serve you in your adventures into the vast uncharted territory of automated black-box testing for generative AI. The path ahead of you is paved with potential for hazard and reward. If you have a product using generative AI and want to learn more about how to test it, schedule some time with us.

Keep reading