Three Principles for Building Multi-Agent AI Systems
We redefine automated test maintenance by using specialized bots for accuracy and efficiency. Here’s how our agents apply that to deliver reliable QA testing.
QA Wolf logo - white
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Three Principles for Building Multi-Agent AI Systems

Automated test maintenance is a notorious time sink—complex, costly, and error-prone. What if AI could do the heavy lifting? Well, that’s what we’ve done. But when we first started, we ran into some issues.

We initially built our AI agent to make independent decisions and take actions based on the state of the test. However, our agent quickly became overwhelmed with managing the entire test maintenance process — from spotting issues to applying fixes and re-enabling affected tests. The reason was that our single-agent approach was a jack-of-all-trades and master of none. The more tasks this single agent had to juggle, the slower and more prone to mistakes it became.

Recognizing this, we rewrote our codebase from scratch using a multi-agent approach. By dividing the workload among multiple specialized agents, each handling a specific task, we improved efficiency and reduced errors. This modular approach allowed each agent to become an expert in its domain, leading to faster and more reliable test maintenance.

You can think about a multi-agent system like a restaurant kitchen where each chef specializes in one dish—one focuses on the perfect steak, another on delicate pastries. Each works in harmony, delivering a flawless meal. That’s how our multi-agent system operates—each agent is a master of its task, ensuring precision and speed. While we can’t change the inherently finicky nature of automated E2E tests, this system allows us to solve those problems more effectively.

The complexity of test maintenance requires innovative solutions

Maintaining software tests is an often underestimated aspect of the testing lifecycle. As software evolves, tests must adapt to new functionality, interface updates, shifting requirements, and environmental change. Yet, the maintenance process is riddled with challenges, making automation even more complex.

Why E2E test maintenance is complicated

Apart from technical and logistical challenges, maintaining automated tests is tough because it's a long process with many steps, over half of which require more than just following instructions—they need context gathering and good judgment.

When a test fails, it can be hard to figure out why. The problem could be in the test, the app, or just a temporary issue with the environment. To figure it out, you need to understand the app, the environment, and the test, including what happened when it ran. Once you have that info, you make a decision, test it, and if it doesn't work, try again.

Similarly, updating tests as your app changes isn't just about making technical tweaks. You need to understand how those changes affect other tests and make the right choices to keep everything running smoothly.

Why it's hard for AI to maintain E2E tests

AI decisions come in two types: one-shot and sequential. One-shot decisions are where you make a single decision based on available information, like when you stop at a red light. However, to maintain a test, AI needs to perform sequential decision-making, which is more complex. For example, if a test fails, you need to investigate the cause, consider what might have happened under different conditions, try a solution, and then check if it works. Complicating matters, test flakes happen, and discerning between flakes and bugs requires a lot of context.

Teaching AI to maintain tests is challenging because, while AI can follow rules and recognize patterns, it doesn't inherently understand the application or the purpose behind a change. This lack of understanding makes it difficult for AI to figure out why a test failed or how to update tests when the application changes. While AI can help with some parts of the process, it struggles with tasks requiring nuanced understanding and deeply sequential decision-making.

How did QA Wolf solve E2E test maintenance using AI?

We designed our multi-agent system to create and maintain end-to-end tests, emphasizing speed and accuracy of fixes.

Optimizing test maintenance with intelligent history summarization

A critical component in maintaining a test is history. Past events in our system are crucial for identifying solutions to current failures. In complex systems like ours, focusing solely on the current state isn't enough—it's like cooking a meal and instantly forgetting what you have done so far, like whether you’ve added salt. Anyone who has seen the movie “Memento” would understand the problem.

One option is keeping track of everything that’s happened so far, but this gets messy quickly. Many of our sessions involve more than 50 actions—from identifying a failure to attempting a fix—and persisting the entire action history incurs significant performance overhead, leads to data redundancy that can degrade system accuracy, and results in substantial storage and processing costs.

Our solution is to create intelligent summaries of past events. Instead of remembering every detail, each part of the system gives a quick, clear summary of what happened. By being smart about how we format and handle these summaries, we keep the system running smoothly without getting bogged down.

Understanding the current state with a policy-driven approach

Armed with an intelligent summary of past results, we use what’s called a “policy function” to decide what to do next. Our system evaluates the current state—all relevant information at a given moment that a QA engineer would consider when troubleshooting a broken test—including the error message, the code that caused it, the HTML at the time of the error and currently, the visible code, the page title and URL, and the history of actions taken.

The role of nodes in our multi-agent system

In our multi-agent system, the relationships between nodes form the backbone of maintaining and repairing failing end-to-end tests. Each node plays a distinct role, contributing to the overall decision-making process, even though they all ultimately serve the policy function that guides the system's actions.

Example of our policy driven approach

Agents

Agent nodes are the starting point. Agents are the primary decision-makers within the system, similar to chefs in a kitchen. They autonomously interact with other nodes, moving between tasks, invoking specific actions, or handing off responsibilities to another agent. For example, our Date Picker Maintainer agent is responsible for fixing any date-picker issues in tests, such as selector problems or Playwright strict mode violations. This agent evaluates the situation, decides on the necessary action, and either resolves the issue or coordinates with other nodes to continue the process.

Actions

Agents call action nodes to get the actual work done—these are the only nodes that directly modify the state. Imagine this as the cooking process in a kitchen, where ingredients are transformed into a finished dish. In our system, the action nodes execute the essential tasks, such as running Playwright code to determine whether a test passes or fails. These nodes are crucial for actively coding and executing tests, directly altering the state to achieve the desired outcome.

Tools

Agents call tool nodes to provide data or resources. Tools are similar to actions, except tools don’t alter the state, in the same way a kitchen scale measures ingredients without changing them. These nodes fetch necessary information, such as how many times a currently failing test has passed or failed in the past two weeks. They supply agents with the insights needed to make informed decisions while maintaining the system's ongoing progress without interference. Once done with performing their requested task, they hand things back to the agent who called them.

Modifiers

Modifier nodes adjust the level of detail available to other nodes, akin to the zoom feature in Google Maps. These nodes don’t change the state itself but fine-tune how much information is visible, helping agents focus on the most critical aspects at any given moment. For example, in test maintenance, a modifier node might reveal the complete console errors or the full HTML of a webpage, allowing the agent to zero in on specific issues. By controlling the granularity of information, modifier nodes ensure that agents have the right level of detail to work effectively.

Control flows

Control flow nodes manage the sequence of operations and guide the flow of actions within the system. These nodes operate behind the scenes, indirectly influencing which agent is active or determining when a process is complete. In a kitchen analogy, this is like a chef coordinating the flow of tasks—whether returning to a previous step or deciding that a process is finished. In our system, control flow nodes are handy when an agent has completed its task but either needs to loop back or isn’t sure who exactly should take over next. They help maintain order and adaptability, ensuring the system can handle complex scenarios and transitions smoothly.

Policy function: Tying it all together

These nodes—agents, actions, tools, modifiers, and control flows—work together to handle the intricate tasks of test maintenance. While each node performs its specialized function, they collectively contribute to the system's ability to decide the next best action, guided by the policy function. The policy function maps of all the nodes and their relationships to each other to orchestrate the entire process, ensuring that the multi-agent system operates efficiently and effectively when maintaining end-to-end tests.

Guiding principles

Our guidelines emerged as a result of developing our multi-agent system. As we learned about our system, we honed these practical guiding principles to drive the development of our AI solutions. These principles are now the backbone of our development process.

Principle #1: Each agent should be an expert on a specific topic.

Specialization allows each agent to work quickly, accurately, and efficiently. When one agent finishes its task, it knows precisely which agent to hand off to for the next step in the process.

This specialization keeps each operation fast and cost-effective. Additionally, this approach allows us to scale the system quickly by adding new agents as needed. Because each agent is focused on a specific task, introducing new ones into the system doesn’t cause significant disruptions, making the process more resilient and adaptable.

Principle #2: Prioritize an evaluation benchmark to measure the end-to-end performance.

When testing our AI agents, it's essential to focus on how the entire system performs rather than just evaluating each agent independently. We aim to see how everything works together, including how the system handles challenges and recovers from mistakes.

While unit testing each agent can provide valuable insights, it doesn’t capture the whole picture. If a change in one agent slightly reduces its performance but leads to an overall improvement in the system, that's a win. The effectiveness of the entire system is what truly matters.

To keep things running well, we run over 75 scenarios every night, looking for any issues and ways to improve. This approach ensures that our AI system "thinks like a QA engineer," aligning its actions and decisions with those of real QA experts. By focusing on end-to-end results, we ensure our agents perform well individually and work together cohesively, much like how a restaurant manager ensures all parts of the operation come together during a busy shift.

Principle #3: The agent's behavior should be driven by an AI system, not hard-coded logic.

This principle represents a significant shift from the old mindset, where logic is written in code which must explicitly changed when we want to change behavior. We recognize that clinging to a rigid approach can quickly render systems outdated and ineffective, especially in the fast-evolving world of AI.

To keep our E2E tests effective, our agents must be flexible and adaptable. Rigid, unchanging logic doesn’t work for us. We achieve this by building on top of LLMs, which are continuously improved by third parties.

Our multi-agent framework acts as the runtime environment—think Node.js—and the agents are like the programming language. For example, when fixing a bug in your JavaScript code, you wouldn’t start by tweaking Node.js or Deno. Instead, you’d adjust the code, similar to how we fine-tune the behavior of our agents within the framework.

This approach marks a fundamental change in how we build and maintain technology. It's about designing resilient systems capable of growth, enabling our agents to learn and improve over time without requiring a complete overhaul for every minor change.

Why a multi-agent system is better for test maintenance

Relying on just one agent for QA testing can lead to significant issues. When a single agent is responsible for everything, it easily gets overwhelmed, resulting in mistakes, higher costs, and slower progress as it struggles to manage too many tasks. Additionally, a single-agent system introduces a critical point of failure—if engineers need to adjust the agent's behavior, any misstep could cause the entire AI system to collapse. This vulnerability makes single-agent systems not only inefficient but also risky.

On the other hand, a multi-agent system splits the work among different agents, each focusing on one specific task. This might seem like it could slow things down, but by sticking to our guiding principles, we make the process faster, more accurate, and easier to manage. While using multiple agents can be tricky because you have to keep everything organized, the benefits—like fewer mistakes, lower costs, and quicker testing—are well worth it.

Ultimately, a multi-agent system is a much better and more reliable way to handle QA test maintenance than just one agent. Our goal is to keep our tests as accurate as possible, like how a restaurant aims to serve the perfect meal. We don’t just look at how one agent does its job; we focus on how all the agents work together as a team. This teamwork is what makes a difference. By making sure each agent handles its own tasks and makes intelligent decisions, our system stays efficient, and our tests remain accurate and reliable—just like a great kitchen team working together.

Keep watching

AI Prompt Evaluations Beyond Golden Datasets
Watch this webinar to see how Golden Datasets fall short in real-world AI projects. Discover how random sampling boosts model adaptability, cuts costs, and ensures reliable, up-to-date performance.
Innovations in Evaluating AI Agent Performance
Join this webinar to explore smarter ways to measure AI session performance with LLMs. We focus on key tasks using weighted scenarios and dynamic metrics, ensuring real-world accuracy and helping you improve performance.
5 Questions to Ask About LangChain for Your Project
Learn why QA Wolf built a custom LLM Orchestration Framework over LangChain or LangGraph, focusing on flexibility, customization, and robust type safety.
Three Principles for Building Multi-Agent AI Systems
We redefine automated test maintenance by using specialized bots for accuracy and efficiency. Here’s how our agents apply that to deliver reliable QA testing.