The way we interact with Large Language Models (LLMs), these amazingly complex and mysterious tools powering everything from chatbots to QA is through a prompt. A prompt can be text, code, imagery, video, music, or any combination of media that the user gives to the LLM; the LLM, in turn responds.
When we use an LLM as a tool to accomplish a task, our prompts have two components: variables and tasks. Variables provide information and context to the AI, while tasks are what you want the AI to do with those materials.
For example, many of us have copied an email into an AI and asked it to “sound more professional.” In this case, the email is the variable, and the task is making it sound more professional. And, as anyone who’s tried this can attest, the way your prompt is written has a huge effect on what the AI returns. Even small changes to a prompt can create huge differences — sometimes positive, sometimes negative — and because LLMs are inherently random, it’s hard to gauge how much better or worse any change will be.
Crafting good prompts is challenging; you need high-quality inputs and clear tasks. But because AIs are so unpredictable, you need to test prompts extensively, and that testing complicates your development process. Teams traditionally use a pattern called the Golden Dataset to simplify that testing process. But Golden Datasets are a poor fit for today’s fast-paced generative AI projects. Those projects need a new approach that handles real-world data variability and enhances AI reliability: random sampling.
Golden Datasets undergo careful cleaning, labeling, and verification. Teams use them to make sure their applications perform well under controlled conditions. They are dependable, like AAA-rated bonds.
The idea in using a Golden Dataset is to provide a stable and reliable base, a benchmark, if you will, for evaluating models. They make it easy and consistent to compare different models. By controlling data quality, like in the scientific method, teams get clean, noise-free signals.
In short, Golden Datasets are fine, But they have some problems.
Golden Datasets are popular because they work well in for benchmarking and evaluating well-defined, easily reproducible problems. However, when we use them for things they aren’t necessarily suitable for, they can make us miss better solutions and introduce biases, preventing our applications from handling complexities they might face in the wild.
Golden Datasets limit a model’s ability to handle diverse and evolving real-world inputs. Models trained on carefully selected, static data perform well in controlled settings but struggle with unexpected or messy data, leading to overfitting—where the model captures noise and specific patterns instead of generalizing to new data, in the same way a new road makes an old paper map obsolete. As production data and trends change, these models can become outdated or “drift.”
Example: A model trained only on pull request reviews from one repository may excel there but fail with reviews from repositories that use different coding styles or conventions.
High cost and scaling challenges of curation and maintenance
Building Golden Datasets is time-consuming and expensive because they require manual curation and cleaning. Real-world conditions change often, so teams must update data to keep it relevant. By the time teams complete a Golden Dataset, it may already be outdated because prompts or input data change. Furthermore, updating the Golden Dataset creates complications—after updates, you can’t directly compare the new dataset to previous versions, making it hard to track performance improvements.
This need for constant updates creates an inflexible and unscalable process for teams that must innovate and iterate quickly.
Potential for inherent biases
Even a carefully curated dataset can reflect the biases of its creators, skewing outcomes when models encounter unexpected inputs. Extending the pull request example above, if the data set includes reviews written only by senior developers, the model might not adapt well to input with styles used by junior developers.
Random samples pull data blindly from a diverse, known dataset—for us, this is our production environment—to capture a wide range of scenarios. Unlike Golden Datasets, which are carefully curated and use structured data, this method doesn’t consider specific patterns or values. It’s like picking data out of a hat.
At QA Wolf, speed is crucial. Time constraints prevent us from building a representative dataset with every change. Instead, we use random sampling to collect data directly from production with minimal cleaning, using Helicone’s tools to gather a wide range of real-world data automatically.
We get the following benefits from using random sampling:
Ultimately, we want to deliver an increasingly valuable product to our customers. For us, that means helping them get to 80% test coverage or higher as quickly as possible. QA Wolf’s use of random sampling shows our commitment to building AI systems that work well in ideal conditions and handle real-world complexities. By continuously adding diverse production data, we create AI agents that create and maintain tests in a system that learns, adapts, and improves over time without relying on static datasets or high maintenance costs.
This approach goes against the usual idea of controlling every variable and instead embraces the richness of real-world variability. Using methods that reflect real-world data's dynamic nature, we create more robust, fair, and effective AI solutions.
Improving AI isn’t about finding a one-size-fits-all solution but creating systems that can grow and adapt. It’s about seeing unpredictability as a chance to innovate. As we continue to develop AI, embracing real-world variability through random sampling will be key to unlocking AI’s full potential.