AI Prompt Evaluations Beyond Golden Datasets

The way we interact with Large Language Models (LLMs), these amazingly complex and mysterious tools powering everything from chatbots to QA is through a prompt. A prompt can be text, code, imagery, video, music, or any combination of media that the user gives to the LLM; the LLM, in turn responds.

When we use an LLM as a tool to accomplish a task, our prompts have two components: variables and tasks. Variables provide information and context to the AI, while tasks are what you want the AI to do with those materials.

For example, many of us have copied an email into an AI and asked it to “sound more professional.” In this case, the email is the variable, and the task is making it sound more professional. And, as anyone who’s tried this can attest, the way your prompt is written has a huge effect on what the AI returns. Even small changes to a prompt can create huge differences — sometimes positive, sometimes negative — and because LLMs are inherently random, it’s hard to gauge how much better or worse any change will be.

Crafting good prompts is challenging; you need high-quality inputs and clear tasks. But because AIs are so unpredictable, you need to test prompts extensively, and that testing complicates your development process. Teams traditionally use a pattern called the Golden Dataset to simplify that testing process. But Golden Datasets are a poor fit for today’s fast-paced generative AI projects. Those projects need a new approach that handles real-world data variability and enhances AI reliability: random sampling.

Golden datasets: the old, faithful approach

Golden Datasets undergo careful cleaning, labeling, and verification. Teams use them to make sure their applications perform well under controlled conditions. They are dependable, like AAA-rated bonds.

The idea in using a Golden Dataset is to provide a stable and reliable base, a benchmark, if you will, for evaluating models. They make it easy and consistent to compare different models. By controlling data quality, like in the scientific method, teams get clean, noise-free signals.

In short, Golden Datasets are fine, but they have some problems.

Disadvantages of Golden Datasets

Golden Datasets are popular because they work well in for benchmarking and evaluating well-defined, easily reproducible problems. However, when we use them for things they aren’t necessarily suitable for, they can make us miss better solutions and introduce biases, preventing our applications from handling complexities they might face in the wild.

Overfitting due to limited generalization and drift

Golden Datasets limit a model’s ability to handle diverse and evolving real-world inputs. Models trained on carefully selected, static data perform well in controlled settings but struggle with unexpected or messy data, leading to overfitting—where the model captures noise and specific patterns instead of generalizing to new data, in the same way a new road makes an old paper map obsolete. As production data and trends change, these models can become outdated or “drift.”

Example: A model trained only on pull request reviews from one repository may excel there but fail with reviews from repositories that use different coding styles or conventions.

High cost and scaling challenges of curation and maintenance

Building Golden Datasets is time-consuming and expensive because they require manual curation and cleaning. Real-world conditions change often, so teams must update data to keep it relevant. By the time teams complete a Golden Dataset, it may already be outdated because prompts or input data change. Furthermore, updating the Golden Dataset creates complications—after updates, you can’t directly compare the new dataset to previous versions, making it hard to track performance improvements.

This need for constant updates creates an inflexible and unscalable process for teams that must innovate and iterate quickly.

Potential for inherent biases

Even a carefully curated dataset can reflect the biases of its creators, skewing outcomes when models encounter unexpected inputs. Extending the pull request example above, if the data set includes reviews written only by senior developers, the model might not adapt well to input with styles used by junior developers.

Random sampling: The better way to measure a prompt

Random samples pull data blindly from a diverse, known dataset—for us, this is our production environment—to capture a wide range of scenarios. Unlike Golden Datasets, which are carefully curated and use structured data, this method doesn’t consider specific patterns or values. It’s like picking data out of a hat.

‍

‍

At QA Wolf, speed is crucial. Time constraints prevent us from building a representative dataset with every change. Instead, we use random sampling to collect data directly from production with minimal cleaning, using Helicone’s tools to gather a wide range of real-world data automatically.

We get the following benefits from using random sampling:

‍

It improves agent flexibility. We use Helicone’s tools to collect data from our application. With Helicone, we can randomly sample different parts of our production data without manual selection. This lets our AI agents see a variety of inputs from our application, helping them handle different situations well.

‍

It increases performance. Random sampling improves our agents by exposing them to different scenarios in our application. When agents see various inputs, like detailed HTML pages or complex code, they learn to handle unexpected challenges better. For example, if an agent misunderstands a button label, random sampling helps us quickly find and fix the issue, making the agents more accurate.

‍

It saves time and money. As mentioned before, creating and maintaining a Golden Dataset takes a lot of time, particularly when things change often. We chose random sampling with Helicone because it’s faster and saves money by reducing maintenance costs and the need for extensive manual curation. With random sampling, we can quickly test smaller data sets, cutting storage and processing expenses. Faster testing cycles also let our team focus on other important tasks, making our operations more efficient. This method is ideal for our application, where data and prompts constantly change, helping us stay agile and cost-effective.

‍

We trust our agents more. By testing our agents with real-world data from our application, we trust them more. Continuous evaluation ensures that agents work well in real deployment environments, which lowers the chances of unexpected problems when we launch agents.

‍

We prevent performance drift. Random sampling keeps our agents performing well, adapting accurately and effectively as production data shifts over time. We continuously provide fresh training data from real-world examples, preventing the gradual decline that can happen with static datasets. As a result, our agents stay reliable and relevant, adapting smoothly to changes and avoiding performance drift.
‍

Building resilient and adaptive AI

Ultimately, we want to deliver an increasingly valuable product to our customers. For us, that means helping them get to 80% test coverage or higher as quickly as possible. QA Wolf’s use of random sampling shows our commitment to building AI systems that work well in ideal conditions and handle real-world complexities. By continuously adding diverse production data, we create AI agents that create and maintain tests in a system that learns, adapts, and improves over time without relying on static datasets or high maintenance costs.

This approach goes against the usual idea of controlling every variable and instead embraces the richness of real-world variability. Using methods that reflect real-world data's dynamic nature, we create more robust, fair, and effective AI solutions.

Improving AI isn’t about finding a one-size-fits-all solution but creating systems that can grow and adapt. It’s about seeing unpredictability as a chance to innovate. As we continue to develop AI, embracing real-world variability through random sampling will be key to unlocking AI’s full potential.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

AI Prompt Evaluations Beyond Golden Datasets

Golden datasets: the old, faithful approach

Disadvantages of Golden Datasets

Overfitting due to limited generalization and drift

Random sampling: The better way to measure a prompt

Building resilient and adaptive AI

Keep watching

About QA Wolf

Resources

Legal

Hello!