Two properties of generative AI applications make them particularly challenging to test.
The first is that they are computationally expensive. They often require high-performance hardware and substantial energy consumption to operate effectively and in real time.
The second is that they return stochastic results, so it’s difficult to assign blame to the model, the related systems, or even the user, when the expected and actual results don’t line up.
Given these challenges, testers — be they developers, QA engineers, or QA Wolves — will need to change how they work and what they focus on. It’ll take careful planning and teamwork. We’ve been watching the transformation play out as more and more of our customers introduce generative AI features into their product, so in this post we’d like to share what we think is going to be new, what’s going to change, and what’s going to stay the same.
If you’ve been a member of a software development organization of any size, you are probably familiar with the integration testing approach known as big-bang—basically, where the team tries to test everything all at once in black-box, end-to-end style. It sounds great because it's quick, requires less technical expertise, and theoretically doesn't cost as much upfront as writing white-box tests. The important word here is theoretically.
With generative AI testing, there is no theory to argue: trying to test everything through the front end with the big-bang approach is going to be way more expensive than white-box testing. That is not only because, as we explained above, every call to model is computationally, and therefore monetarily, expensive, but also because these applications require a significantly greater number of tests to run to prove that they behave consistently. So, in the interest of saving your organization money, your team will need to pay a lot more attention to white-box unit and component integration testing and, in particular, mocking the model when not explicitly testing it and minimizing large queries and responses.
That increase in white-box coverage also needs to be accompanied by another shift.
Testers and developers have traditionally stayed out of each other’s business, allowing each group’s testing efforts to develop independently. It makes sense. After all, if you know what areas other people are testing, you may be inclined to cut down your own efforts in that area, leading to some test cases getting missed.
Since the cost of testing skyrockets when dealing with generative AI, companies can’t afford to isolate their testers from the rest of the engineering organization in the name of testing objectivity — the separation of white-box testing in Development and black-box testing in QA creates too many duplicative test cases that only the largest organizations in the world will be able to afford.
Teams need to know two things about their model to determine the scope of the testing: 1) whether the model is proprietary or off-the-shelf, like GPT-4, and 2) whether the model uses the input for training purposes. If you don’t understand who developed the model, you risk duplicating some testing effort, either that of your internal team or, in the case of third-party models, that of the model provider. And remember, any duplication in generative AI testing can get expensive fast.
In-house models come with significantly more complexity and require lots more testing. The model development team tests the ML algorithm and the hyperparameter configuration. However, someone on the team needs to test the data for completeness. The model data testing process makes sure training data adequately represents the reality of what you’re trying to generalize and that your team hasn’t oversimplified or inadvertently included any informational or human-induced bias.
That said, most teams aren’t developing models in-house but instead rely on third-party models, which require less testing, primarily in assuring that the orchestrator component sends the query to the model correctly and acts appropriately when receiving an unexpected response from the model. However, we will cover both model types here because there are rare instances where teams develop their own models or even combine them.
In addition, applications that utilize user input for real-time model training or adaptation have different testing and development requirements when compared to those that do not incorporate said input into an ongoing learning process.
For applications that do, it's essential to test the underlying model's ability to learn and adapt dynamically and the robustness and scalability of the data handling mechanisms, such as vector databases, used to store and retrieve this information efficiently. These systems might require tests to ensure the accuracy and effectiveness of the model updates based on new input.
On the other hand, teams with applications that don’t use user input for immediate model training should thoroughly test their models’ static aspects, such as the prompt templates and the model's response consistency. This includes ensuring the prompts generate the intended actions or responses from the model and that the model's outputs remain reliable and high-quality over time.
Unlike traditional applications where InfoSec might focus more on perimeter defense and rule-based compliance, generative AI applications need proactive simulation of potential threats and misuse scenarios because generative AI systems can generate outputs dynamically based on the inputs they receive, making their behavior less predictable and more susceptible to exploitation in unique and unforeseen ways. Testers’ creativity and technical insight are crucial in foreseeing complex attack vectors that might not be immediately apparent. By simulating these threats proactively, organizations can identify and mitigate potential vulnerabilities early in the development process, ensuring the AI system is robust against both current and future security threats.
When a team performs security testing on an application that uses an in-house model, they are testing the integrity of the data while also validating that the app protects that data. With applications using off-the-shelf models, teams might rely on third-party security measures. In contrast, in-house models offer the flexibility to implement custom security protocols tailored to specific needs. In either case, you’ll want to test that data, particularly sensitive information, is encrypted and handled securely in transit and at rest. Additionally, with in-house models, you’ll want to test security in the training and inference phases with adversarial testing, penetration testing, or related techniques.
When working with a third-party model, your team will likely want to put safeguards in place to ensure the data passed to the model isn’t used to train that model if that’s critical. Those safeguards need to be tested. It doesn’t matter who does the testing, but someone needs to make sure your team has coverage.
There’s a new addition to how teams need to think about performance testing when it comes to generative AI apps.
Generative AI applications, particularly those utilizing language models, often use a token-based system for processing input and generating output. The number of tokens processed can significantly affect performance and costs. If you're using a third-party model, chances are that your team will perform the bulk of the testing on the prompt template, and prompt templates can grow quickly, depending on what you are doing. In-house models allow for custom optimization of token usage and performance tuning to balance cost and efficiency.
In addition to watching token usage, teams need rigorous testing to identify bottlenecks in processing large or complex requests and scaling resources accordingly.
Testing for bias and ethics in generative AI apps isn’t something that people have to worry about with traditional apps — any screening for inappropriate material needs to be coded directly into the application. With GenAI applications, questionable or inaccurate material could be generated out of nowhere. This makes spotting and fixing any biases or ethics violations tough but testers will need to spend time learning how to do it.
The potential damage if they don’t can be significant: from reinforcing old stereotypes in a generated picture to making the wrong diagnosis from health information. What's more, depending on your application, the underlying AI model can keep updating behind the scenes even after we think it's done, which means it could pick up new biases on the fly if we're not careful.
To do this type of testing on the model, teams can use training data sets specifically geared to test this sort of thing. But think again if you think your team is off the hook for this testing because your app uses an off-the-shelf model, and someone has already done this testing for you. You see, many off-the-shelf models have rules for things like illegal activity and self-harm that will trigger moderator review, and they have policies that they may report such conduct to the authorities. You don’t want that to happen, so the team needs to understand the model’s code of ethics and test to make sure the application doesn’t send over prompts that violate it (preferably without using the live model)
Production monitoring is not traditionally considered a testing activity. However, as more teams have moved to continuous delivery and have learned to release fixes in a matter of seconds, they have found a robust production testing strategy helps them deliver features while meeting their customer’s expectations for quality.
In this way, production monitoring is often part of the testing process in the generative AI app world. Because LLMs return stochastic results, observing them with various inputs is best to determine that they behave consistently over time. There’s simply not enough time to perform that testing proactively and remain a viable business.
Furthermore, continuous monitoring in production allows app developers to detect and respond to issues in real time. Not only do generative AI applications produce unpredictable outputs, but their fidelity can degrade over time as the context and data evolve.
Monitoring off-the-shelf models involves tracking usage metrics, error rates, and user feedback. For in-house models, the monitoring can be more granular, including model drift, data quality issues, and the impact of incremental training updates. Production monitoring must be designed to quickly identify and mitigate issues affecting the user experience or leading to unethical outcomes. Furthermore, teams use production monitoring in conjunction with release strategies such as A/B testing, which allows them to introduce changes to the model in a targeted way until they determine the model is ready for general availability.
Developers love black-box testing, and not just because they are usually not the ones who have to do it. Black-box testing increases overall confidence in the product by giving teams a window into how the customer will experience the application. Generative AI changes the amount of emphasis teams place on black-box testing but doesn’t change the need for it. Teams will still need a battery of black-box tests, preferably automated, to make sure their application is production-ready.
We would not encourage anyone to go off half-cocked and do some ad hoc testing on a generative AI app without understanding it. We’ve seen testers report “bugs” in cases like this only to find out they were doing it wrong and end up with egg on their face.
With generative AI apps, exploratory testing becomes a little more technical and in the weeds. Exploratory testers may need to try their hand at tweaking the prompts or preparing the right contextual environment to test the prompt.
In other words, people who are good at finding bugs are still very much needed.
Some of the most prized testers on the team are those who are good at thinking outside the box and coming up with probable, impactful scenarios that no one else thought of. That’s not going to go away with generative AI. We think that kind of creativity, enhanced by a little functional knowledge of how generative AI apps do what they do, is a winning combination for the future of anyone looking to test these applications.
We think it’s time to re-evaluate testing in the world of generative AI. Testers were once encouraged to put blinders on to make themselves better testers, but that siloed worldview has to disappear. Instead, teams need to expand the scope of testing to include non-functional areas, ethics, bias, and production monitoring. They need to adjust their definitions of security and performance testing. Most importantly, they need to create a tight collaboration between developers and testers and evaluate how to create test coverage for as much of the application as possible without bankrupting the test organization.
At QA Wolf, we’ve been performing automated black-box testing on generative AI applications for quite some time now. Reach out so we can help you plan your coverage so that you can help your team focus on what they do best.