Three questions to ask about AI-powered testing tools and services

Kirk Nathanson
March 27, 2024
I'd say the various no-code AI tools were very easy to set up but very difficult to scale. It always felt like the no-code AI tools just made things more complicated and unpredictable in more unpredictable ways. The premise is interesting but they slowed you down about as much as they sped you up.
Philip Seifi, co-founder @ Colabra

Regression testing is a repetitive, ceaseless operation — Does that work? Yeah. Does that work? Yeah. Does that work? Yeah — that’s why so much effort is spent automating it. Robots are great at repetitive tasks. But just running automated tests is a very, very small part of what goes into QA automation.

The real effort is investigating test failures after each and every run. That means manually reproducing the test to confirm there’s a bug, re-running the test if it flaked, or fixing the test if the latest release changed the expected behavior.

The reason that two-thirds of companies test less than 50% of their user workflows is because it takes one test automation engineer for every four front-end developers to keep up with the investigation and maintenance demands of a test suite. That’s why AI-powered testing tools that promise “natural language” test creation and “self-healing” are so appealing: if you could get faster and cheaper test creation and maintenance, you could build and maintain a more extensive test suite and test more often.

While most (if not all) AI-powered testing tools use an off-the-shelf model from OpenAI or Anthropic, the particulars of how they’re implemented make a huge difference in their ability to increase test coverage, expedite QA cycles, and make humans more effective.

Here are three big questions you should ask about the AI technology powering QA solutions.

Is it actually “self-healing” or just self-modifying?

There are scads of tools out there that claim to “self-heal” and “auto-heal.” That sounds pretty cool, and it can actually be helpful if implemented properly.  So, it’s important to understand what a proper implementation looks like.

Broken selectors are probably the most common situation that a “self-healing” AI will encounter. A selector is an element property that automated tests use to identify something on the page, such as a button, a link, or a block of text. Selectors change frequently as developers tweak the UI or underlying code (Read: “What makes a good selector”), causing errant test failures that have to be resolved.

Updating a selector is a fairly trivial job for a human, what with their judgment and intuition and other human niceties. But self-healing AI has none of that. So they have to fall back on a fairly basic decision-making process: “Only a true bug could stop a test from passing. So if the test can pass, then ipso facto, there can’t be a bug.” To an AI, the logic is sound. All the AI needs to do is fix what’s wrong with the test, and the errant failure will be resolved. However, a genuinely valuable self-healing AI must understand intent, not just outcome.

Let’s look at examples that show why this distinction is critical.

Example 1:

Your app has a built-in messaging system. To view your messages, you have to click a button that dynamically shows the number of unread messages, like “Inbox (3)”. When the AI writes a test to open the inbox it uses “Inbox (3)” as a selector and doesn’t know that the message count can change. On the next run, the UI shows “Inbox (7),” which breaks the test, so the AI replaces the selector with “Inbox (7),” and the test passes.

This approach works, albeit inefficiently. The AI can make this change to the test on every run, forever.

But the approach has its downsides: if you modify a failing test just so it passes, you could change the intention of the test and let a bug through. And that’s why companies need to be cautious about the generative AI they use for QA.

Example 2:

A developer accidentally changed the “log in” button to read “logggin.” The test that checks the log-in flow fails because it can’t find the old (correct) selector, and the AI attempts to auto-heal, which it does by changing the selector to “logggin.” Now, the locator is valid, and the test resumes and passes. And if you are practicing continuous delivery, you have a glaring typo on your production site.

The worst part is not the embarrassment of the typo; it’s that your developers no longer trust that the AI is accurately testing their releases — so they go back to manually testing each PR.

The AI was so preoccupied with what could pass it never consider what really should

For AI to genuinely "self-heal," it needs more than just the ability to change a failing test to a passing one—it needs the judgment (something AI is distinctly bad at) to determine whether the change aligns with the original purpose of the test. True healing involves an adaptive correction process, where the AI flags inconsistencies for human review after making adjustments. This added layer of oversight acknowledges that AI lacks the nuanced understanding needed to verify if a fix truly "heals" the issue or simply modifies it superficially.


Without adaptive correction, the AI might push changes that technically resolve a test failure but miss the intent, introducing unintended behaviors or errors into production. By using human review, the system can catch these issues—like a misinterpreted rule or accidental typo—before they affect users. True self-healing blends automation with judgment so the test aligns with its original intent, not just a passing status.

How does the AI solution manage the vast amounts of data?

Every AI system has a maximum amount of information and history that it can use in its decision-making, and techniques for handling that data. One such technique, called “pruning” (or “selective filtering”), involves the AI strategically ignoring certain details to streamline decision-making.

Imagine an e-commerce checkout process where a “Shipping Method” dropdown menu typically defaults to the standard shipping option. A recent update changes this default to express shipping, which incurs an additional fee. However, the AI sees the dropdown as a functional element with multiple valid options and doesn’t register this change as significant, so it doesn’t flag the default update. The test passes because all options are technically functional, but a human would catch the unintended cost impact on users who don’t actively choose a shipping method.

QA Wolf uses an alternative approach called “summarization and targeted retrieval.” This method retains recent history—often spanning a set period, such as two weeks—and provides summaries of relevant changes to reduce noise. The system accesses these summaries to identify key trends and behaviors without combing through extensive logs. When deeper insights are required, the system allows targeted queries into specific events within the stored timeframe, ensuring all relevant context is available on demand.

This approach ensures that critical information is never lost, even across multiple test cycles. For example, if a button’s behavior changes unexpectedly, the system might summarize all UI changes during testing, helping identify relevant updates quickly. Rather than pruning data to fit into fixed token limits, it preserves important information for later review,

This strategy balances efficiency and depth, reducing the risk of missing critical details while avoiding the high costs associated with large memory models. It offers predictability and reliability by giving the system access to essential context only when it most needs it.

Does the AI generate portable (and readable) Playwright code?

From time to time, you may have to debug an AI-generated test. (Crazy, we know!) You may even wish to take your test code from one solution and use it somewhere else as your needs change and the technology evolves. But the way many generative AI testing tools work is by wrapping vanilla Playwright (or Cypress) in an assistive abstraction layer (i.e., a UI)which hides the underlying code from the tester so you can’t modify it. And that creates a potential for portability problems. While there are exceptions, and some tools allow you to export the underlying code to Playwright or Cypress, that doesn’t mean you’ll be able to run or even understand exported tests out of the box. Any code that should be shared code will be written out into each test, so it's on your team to refactor to adhere to the DRY (Don't repeat yourself) principle.

These AIs often produce JavaScript that’s optimized only for the tool’s own needs. This code is typically complex, inefficient, and lacks the structure or clarity needed for effective maintenance, as it’s not generated with human readability in mind. It might even include dependencies on hidden, internal services you’d have to recreate just to make the test work. These AI-driven tests lock you into a single platform. If you ever want to change your testing framework, you’ll likely find the code unusable, which means you have to rebuild your test suite mostly or entirely from scratch.

When evaluating AI-driven testing systems, look for those that prioritize collaboration and portability, where arbitrary JavaScript is purposefully generated to be readable and functional. AI should create tests using well-established frameworks, such as the AAA (Arrange-Act-Assert) pattern, to ensure they are easy to read, understand, and maintain. Clear, structured code fosters teamwork by making it readable for technical team members. Additionally, a portable system allows tests to be easily transferred or adapted across environments and platforms without requiring a complete rewrite. These strategies ensure that the tests remain functional long-term, supporting continuous delivery efforts and reducing dependency on a specific tool or vendor.

Self-healing tests don’t eliminate the need for human investigation (yet) because the AI can’t be completely trusted (yet)

At the top of this piece, we pointed out that the most expensive part of automated testing is investigating failures and updating tests when the UI changes — actions that seem like they can be delegated to generative AI. But let’s really acknowledge what we’re asking generative AI to do for us because it’s not simply fixing selectors. In fact, we’re entrusting AI with business-critical decisions about the state of our software that could have huge repercussions on our users and our company. And the problem is that AI can’t yet fully be trusted to act independently. That time is coming — probably… maybe… eventually — but if you are banking the future of your QA processes on AI, you have to find tools that truly add value to your testing process.

A genuine self-healing system does more than convert failures to passes—it aligns changes with the original test intent. Effective AI also manages data in a way that reduces noise, making it easier to identify flaky tests and recurring issues without overwhelming your team. Given the rapid rate of innovation in this field, code portability is crucial; the AI should produce readable, flexible open-source (e.g., Playwright) code to avoid platform lock-in and ensure tests remain adaptable.

By identifying tools that fully support your goals, you can avoid common pitfalls and enhance your testing process so that it actually benefits from automation rather than creating a hidden liability.

Keep reading