Generative AI Testing

Deterministic assertions for non-deterministic
products and features

Trust the outputs of genAI features
Make sure features using genAI return consistent, relevant, and precise results — and determine when the prompt agent or model caused the regression.
Control compute costs
for genAI testing
Avoid burning through tokens and your testing budget with selective execution and smart sampling techniques.
Maximize genAI
testing coverage
  • AI-powered assertions
  • Token usage regression
  • Bias and fairness testing
  • Prompt template testing
  • Failure injection
  • Model testing
  • Model consistency testing
  • Invariance testing
  • Metamorphic testing
  • Unimodal or multimodal apps

Coverage you can rely on

There are several key challenges when it comes to automated testing of generative AI features:
  1. Generative AI is stochastic and doesn’t generate the same result every time. That makes it hard to define a “pass” or a “fail.”
  2. The number of test cases is effectively infinite. With some work you can constrain the randomness of genAI outputs but you can’t define all the possible inputs a user may try.
  3. The underlying models are changing all the time. Even if there’s no change to your agents, the LLM you’re using could cause regressions.
Fortunately, QA Wolf has developed several novel techniques for testing generative AI applications and features, using a mix of AI and strict determinism to ensure the consistency of results while keeping token usage from repetitive testing to a minimum.

How it’s done

Our testing techniques depend on the goals you have for your app and the type of output it responds with (text, artwork, or DOM-less canvas elements).
Depending on the test case, we may run the output back through an LLM with a detailed analysis prompt which produces a deterministic result.

➔ Deductive assertions

With deductive assertions we can use context clues to programmatically assess the accuracy of an application’s AI-generated output.
A "golden master" is a known good result, taken from a previous successful run of a test. It may be acceptable to do a fuzzy match, where test would pass if the difference is within a specified percentage threshold.

➔ Structured data

The automated test will convert an AI-generated output to XML, SVG, or other structured data format, parse it, and compare it to a “golden master.”
Reducing randomness is particularly important when the AI-generated output is an input to a later part of the test case.

➔ Seeding & Temp Control

As appropriate, we will recommend adjustments to the application that reduce the randomness of the output.

Faster QA

Trusted by the best.

Get started

Testing genAI for reliability & consistent outputs

Reliable AI needs reliable testing. QA Wolf builds, runs, and maintains generative AI test cases to catch hallucinations, ensure consistency, and verify outputs—so your AI delivers the right results every time.

Input validity

Input format validation

Verify that the generative AI system correctly processes various input formats (e.g., text, image, audio) and handles invalid inputs gracefully without crashing or producing errors.`

Input length handling

Test that the AI model can handle inputs of varying lengths, including very short and very long inputs, and produce relevant, coherent outputs in each case.

Special characters handling

Check that the AI correctly processes inputs containing special characters, symbols, or non-standard Unicode characters.

Output consistency

Consistent output for same input

Validate that the AI generates consistent outputs for identical inputs over multiple runs.

Output coherence across sessions

Verify that outputs are coherent, contextually appropriate and maintain continuity when given related inputs in a single session or context.

Stable output under minor variations

Assert that insignificant variations in input result in consistent output.

Model performance

Response time under load

Check that the AI model responds within acceptable time limits under various load conditions.

Accuracy of generated content

Validate the accuracy and relevance of generated content based on pre-defined benchmarks or sample outputs.

Resource utilization

Test that the AI model efficiently utilizes system resources (CPU, GPU, memory) without causing undue strain or bottlenecks during operation.

Security and privacy

Data encryption

Assert that all input and output data are encrypted during transmission and storage.

Access control

Verify that only authorized users can access the AI system and its generated outputs.

Data anonymization

Test that any personally identifiable information (PII) in the inputs or outputs is properly anonymized.

Scalability

Horizontal scalability

Validate that the AI system can scale horizontally by adding more instances to handle increased load without degradation in performance.

Load balancing

Check that the system effectively distributes load across multiple servers or instances and maintains optimal performance under varying loads.

Concurrent user handling

Test that the AI model can handle a high number of concurrent users without significant performance drops or errors.

Compliance and ethical standards

Regulatory compliance

Validate that the AI system can scale horizontally by adding more instances to handle increased load without degradation in performance.

Ethical guidelines adherence

Check that the system effectively distributes load across multiple servers or instances and maintains optimal performance under varying loads.

Transparency and explainability

Check that the AI system provides clear explanations for its outputs, enhancing transparency and trustworthiness in its decision-making process.

FAQs

We can test any generative AI app or LLM as long as you can give us access, either through a UI front-end or a back-end API. Once you set up your application in the environments best suited for your testing needs, we'll customize our tests to work with them.

We have two main approaches when it comes to non-deterministic assertions: we can set the temperature on the model to return more predictable results or pass the output to an AI to evaluate the model's responses. Our strategy depends on what we’re testing and what your team thinks is most important.

Sure can! We automate "adversarial" tests that purposely introduce bias to check that your application doesn’t get tripped up. Remember, though, monitoring for bias is really a long-term game, best played in live production environments—a service we’re not offering just yet.

We use GCP Cloud SQL with AES-256 encryption for data at rest, and our system-to-system chats are safeguarded by TLS via Google Kubernetes Engine. But the best way to protect sensitive data when testing is to limit the tests access to it in the first place, unless the test is specifically testing data security. We recommend that our customers limit our sensitive data off the table entirely—mask it, or better yet, go with synthetic data for testing.

We use Microsoft Playwright for authoring tests.  Where appropriate, we use the framework’s visual assertions in combination with our visual diffing tool to perform a pixel-by-pixel comparison against a known good image and returns the percentage of detected change. It all runs on Kubernetes and Docker.

We can meet your team wherever they are, whether that’s scheduled runs, triggered runs from SCM like GitHub or GitLab, or API calls. We can run on ephemeral environments to validate individual PRs, and you can designate specific tests (or all of them) to be release blockers if they fail.

Since we’re a black-box testing service, we don’t have access to your production systems. We focus purely on what we can test from the outside.

We report the most critical information — whether the test suite passed and if it didn’t, where the bugs are — through your messaging app, SCM, and issue tracker that your devs are already in. You can get more detailed and historical information in the QA Wolf dashboard.

Keep reading