Innovations in evaluating AI agent performance

Yurij Mikhalevich
Jon Perl
September 26, 2024

As we develop our AI agents, it’s important to understand their performance; specifically how accurate their solution is to a given problem. Each time an agent interacts with a large language model (LLM), its response varies based on the specific situation and past interactions. The variability in responses makes it hard to gather consistent data because each one is influenced by different factors, making it challenging to predict performance or measure it reliably.

Traditional AI metrics focus on individual actions. Traditional metrics are like a coach only watching an athlete during one drill — they miss the full story of progress and improvement. Sure, those single-focus measurements are useful, but they don’t show the bigger picture. To truly understand how our agents evolve, we need more than just isolated snapshots — we need to see how they perform consistently over time and on the tasks that matter most.

To fill this gap, we’ve developed new methods to evaluate agents that measure their performance over time — both in controlled environments and real-world scenarios.  By comparing how agents perform in training with their real-world results, we uncover both the strengths of our approach and the key areas that need improvement. It’s like checking if the training really pays off where it matters most, making sure our agents are ready to excel in real-world challenges.

The complexity of agent evaluation

At QA Wolf, we create and maintain automated tests for our customers, and we’ve trained AI to manage these tasks. To ensure reliable performance, we design evaluation scenarios that test how well agents can create and maintain tests.

We prepare our agents (athletes) for real-world tasks (competitions) by simulating real-world conditions, observing them as they execute plays (scenarios), and tracking their progress. If an agent performs well in a training scenario but struggles when encountering that same scenario in the real-world, it lets us know that we need to adjust our training methods.

One of the biggest challenges in measuring agent performance is the complexity of real-world scenarios. Agents aren’t just handling one task at a time—they’re juggling multiple tasks that are all connected. These tasks can have conflicting goals, unpredictable data, and sudden changes from other agents, creating a constantly shifting environment. Because of this, it can be hard to measure performance consistently, since even small changes can ripple through the system and affect the overall outcome. To pinpoint what’s causing a performance issue, we need to look at the bigger picture and understand how all the moving parts interact in real world conditions (i.e. in production).

Building a comprehensive test automation evaluation framework

Traditional evaluation methods, which may focus on isolated actions or stable tasks, struggle to capture the intricate dynamics and adaptability required in such environments. To supplement the limitations of traditional metrics, we built a new framework that evaluates session performance while agents create and maintain automated tests. At the center of this framework are our "gym scenarios."

How we measure session performance

Our gym scenarios are inspired by tools like OpenAI Gym, which is widely used in AI research. Just as the OpenAI Gym provides environments for reinforcement learning models, our gym scenarios create controlled conditions where real-world test creation and maintenance challenges are simulated.

The gym health metric: our weighted scoring system for accurate evaluation

To accurately evaluate our agents, we implemented a weighted scoring system. Much like a teacher or professor weighting assignments in a class, each gym scenario is assigned a weight-based score, with more critical tasks contributing more heavily to the overall evaluation than less important ones.

This innovative weighted approach allows us to rank each gym scenario by importance. A flat system, where all scenarios are treated equally, wouldn’t account for the varying significance of tasks. Our weighted model better aligns our scenarios with agent success, and the system is dynamic, meaning the weights can be adjusted over time as our agents evolve and encounter new challenges. This ensures the evaluation remains accurate and relevant, reflecting real-world priorities.

How we maintain real-world relevance with dynamic weighting

As we continually make changes to our agents, our goal is to ensure that the performance scores from our gym scenarios stay aligned with how agents perform in real applications. If we misjudge the importance of tasks—by giving too much weight to minor ones or not enough to critical ones—our evaluations would be skewed, leading us to focus on the wrong areas for improvement. This would result in our agents struggling to perform effectively in the real-world.

To address this, we developed another innovation: automatically classifying real-world sessions into their matching gym scenarios. By breaking down real-world events into smaller tasks, we can track how often each gym scenario occurs in the real world and continuously update our weights based on that data. This creates a feedback loop, where actual agent performance guides and adjusts our scenario weights to stay accurate.

How we identify missing scenarios and coverage gaps

We also track any scenarios our sessions fail to handle effectively. Each time a session fails, we add to a "missing scenario" counter, which helps us identify gaps in our scenario coverage. This process ensures that we don’t miss critical real-world tasks and that our sessions can handle an increasingly broader range of situations.

How me improve training efficiency with key scenario focus

One challenge in training AI agents is the need to run many different evaluations to catch performance issues. However, running every possible test can be time-consuming and expensive. Our weighting system helps solve this problem by allowing us to focus on the most important scenarios—the ones that represent the most common real-world tasks.

Instead of running every test, we only need to run about 5% of the key scenarios. These high-weight scenarios are the ones most likely to reveal performance issues, and by focusing on them, we can still catch 95% of the problems. This saves significant time and cost while ensuring our sessions achieve near-optimal performance in real-world situations.

Refining benchmarks with real-world data for dynamic evaluations

In AI, improving performance often starts with retraining the model to fine-tune its internal logic for better task handling. However, this creates a cyclical challenge: to measure performance accurately, we need consistency from our agents. But achieving that consistency is difficult without first having accurate measurements. To break this cycle, we need to look at our benchmarks.

Traditionally, benchmarks are set arbitrarily, relying on controlled environments that don’t reflect the complexity of real-world tasks. We take a different approach. Rather than constantly retraining the model, we focus on refining our benchmarks—the standards by which we measure success. By adjusting the weights of our gym scenarios, we ensure our evaluation framework aligns with the real-world challenges our agents encounter.

Our benchmarks are based on real-world data, evolving as our agents face more complex and unpredictable situations in production. This dynamic approach ensures that our evaluations remain relevant and reflect real-world performance, allowing us to continuously and confidently improve.

Paving the way for smarter, real-world test automation

We’ve introduced several changes to improve how we evaluate our sessions — much like how a trainer fine-tunes an athlete’s plan to ensure they’re ready for competition. From weighted gym scenarios to dynamically adjusted metrics, we’re changing how we measure success by focusing on our agents' challenges in the real world.

Keep reading