Three traps to avoid when doing E2E PR testing

John Gluck
July 16, 2024

PR testing—all the cool kids are doing it. If you’ve configured your repo to run code analysis or unit tests before merging, congratulations, you’re already doing basic PR testing. But running your entire end-to-end suite, well, that’s on another level. The first thing you'll notice is flakes--a ton of them.

Flakes in tests often stem from testers trying to work around a problem in the app that developers don't take ownership of.  It may only be a problem because the app's not quite testable enough or the environment is underprovisioned.  Something not in QA's purview or on the developer's priority list to fix.  But then, when you start PR testing and a workaround that sporadically fails starts doing so more loudly, you suddenly realize, "Oh, that's tech debt."

E2E PR testing exacerbates the already enormous burden on your team to investigate failures and maintain broken tests. Since PR testing runs your tests more frequently, flaky tests get noisier. If you’ve got 100 tests running once daily with a 15% flake rate, you’ll see 15 daily failures. With PR testing and five developers each making one PR daily, you’ll see 75 daily failures. Suddenly, the irritating hum of test maintenance becomes an inescapable cacophony.

With so much noise from unaddressed flakes, there’s a tendency to turn off automated tests and manually spot-check the PR just to get it out of the chute. But that noise indicates a deeper problem—previously hidden tech debt built into the test suite and running infrastructure. Most companies are surprised by the volume of test tech debt E2E PR testing exposes when the engineering org starts running those impacted tests on each PR.

Teams can fall into three traps after implementing an E2E PR testing strategy. These traps result from the team’s response to the problems caused by a suboptimal PR testing implementation, and paradoxically, they blunt or erase the gains that teams hope to see by implementing E2E PR testing.

Trap 1: Falling back to manual testing

Test data collisions happen when two separate tests simultaneously modify the same system’s data. Since most test data collisions result from race conditions, they often manifest as flakes. In theory, collisions are avoidable when the team designs the test suite so data is completely isolated from other concurrently running tests. However, test scripts written before a team moved to PR testing usually don’t prioritize data isolation since concurrency wasn’t a requirement when they were created.

Flakes prevent developers from merging code, and no one wants to hold back a release because of a test problem. After all, the purpose of PR testing is to test earlier so you can ship faster. Increasing flakes cause your team to spend more time investigating problems than working on new features.

So, when testing teams are under the gun to deliver, the fastest way to resolve test data collisions is to turn off the tests with the collisions and execute them manually. But, besides being slower, manual testing increases the risk of missing a bug that escapes production.

To ship faster, you need to test faster, which means decreasing overall test execution time and reducing the frequency of your test flakes and test data collisions. To prevent collisions, you need to train testers to create the unique test data needed for a given case and tear it down. Your team will need to refactor much of the existing code to accomplish this, so much so that a complete rewrite might be in order. Furthermore, your team may have to add a testability feature to the app, which adds more work for developers, thus slowing down delivery. So there’s a conflict between delivering in the short term and going faster in the long term.

We’ve written an article with a section specific to writing tests to avoid test data collisions. While the best method is to create unique data per test per run, there are other ways if you work on an internal team and have some control and influence over your application and environment.

Trap 2: Only running a few, select tests

As we said, teams are often surprised by the number of tests that start flaking when running them on every PR. PR testing exposes data isolation and concurrency weaknesses in the test code, causing flakes that block the pipeline and noise that increases the risk of escapes.

One solution that teams attempt is to curate their test suite into smaller, purpose-driven sets — smoke tests, sanity tests, etc. Teams commonly attempt to solve this problem by selecting a set of healthy or essential tests to run on PRs. This strategy makes sense as an early step. It spares the team from the need to immediately address the tech debt that surfaces when they run all the tests before merging. After all, running some tests before merging is better than running none.

But, this approach creates a false sense of confidence. Bugs that would have otherwise stopped the PR from merging sneak into the shared testing environments because the team didn’t include them in the curated list. And those flakes that inspired the team to curate their test suite are still running there, creating noise and making it hard to tell if there are new failures caused by bugs or if they are all just the same old false positives. When this happens, bugs get mistakenly labeled as false positives, released to production, and found by customers. Production escapes are one of the most common side effects of test noise in shared environments.

Teams commonly overlook deferred maintenance of automated tests, a significant source of tech debt for many teams. Most teams don't have the resources to maintain and update tests while also shipping features. QA Wolf solves that problem by maintaining your tests for you, so your team can focus on delivering high-quality features to your customers.

Trap 3: Running tests sequentially or in batches

Test node contention occurs when multiple tests run simultaneously, but the infrastructure (nodes) they run on is limited. It’s like a traffic jam. If you want to fit more cars on the freeway, you need to widen the road. Likewise, if you want to run more tests, you need to beef up your parallel execution infrastructure; otherwise, the tests will start clogging (queueing) the onramp.

A traffic jam isn't an immediate problem if the road you are driving doesn’t currently have one. Similarly, node contention doesn’t cause problems before you implement PR testing because it doesn’t have an immediate impact on developer productivity. However, once a merge hinges on the test results, and those tests take hours just to get an available node, contention becomes a big problem. And if you have a system where multiple teams release simultaneously, you create a virtual rush hour where teams create traffic jams by competing for available execution nodes.

Full parallelization is the antidote to test node contention and its side-effect: babysitting builds– the time your team wastes waiting for tests to pass. At QA Wolf, we get around the problem by running all tests in parallel. It’s like having a separate lane for every car on the road, so some of our customers’ test suites take as long to complete as their longest test. We think fully parallel testing is the way to go and that you should do it every time, regardless of whether or not you’re using QA Wolf (we hope you will). We know it can be challenging to implement, so we’ve written some articles (here and here) that should help you if you plan to go it alone.

PR testing support at QA Wolf

We built our PR testing support so our customers could continue shifting left to pursue pain-free releases and continuous delivery. Your team deploys to a temporary environment, preferably, or to one of several static environments reserved specifically for PR testing. We give you a certain number of environment slots, representing the number of concurrent PRs we will handle for you, after which your subsequent PRs get queued. It’s that simple.

When done right, PR testing can stabilize shared testing environments by preventing buggy builds from getting deployed in the first place. Furthermore, we write and maintain your tests with our zero-flake guarantee. Your organization will have a noise-free signal and a better idea of the health of any given release candidate. With a clear signal, your team can move features to production more quickly.

Keep reading