Unless your team really loves to live on the edge, chances are that you use a series of environments from local previews up to Production. The goal of these environments is not only to work out different kinds of problems in the code but also to test as you build.
At some of the higher environments, like Staging and Production, the differences between versions of the SUT (System Under Test) may seem pretty minor or even nonexistent. That might lead you to conclude that an end-to-end test built for one environment would transfer over perfectly to another environment. But, in fact, those seemingly minor differences in the operation, behavior, and performance of a given build on two separate environments dramatically impact the likelihood of failure.
Our experience tells us that, when building an end-to-end test suite for your product, we either have to focus on a single environment for regression testing or build unique tests for each environment in your pipeline. Not because the environments might be different but, very often, are required to be different.
Before we dig in, here’s a quick refresher on common environments. If you have two or more environments, and you probably do, chances are you promote any given changeset through one or more of these environments. Note: your organization may not have all of these environments, or it may have more, or your environments may go by different names.
The development environment is typically the first place an application gets deployed after local development—warts and all. You’ve probably got unit tests and maybe some integration tests that might run here automatically after you merge, but you are less likely to be running full end-to-end regression suites in the dev environment yet since many tests would fail or have to be disabled for the build to be promoted.
Testing (sometimes called the QA environment) is usually the next environment up, but it’s not uncommon for teams to combine Development and Testing, or Testing and Staging. Either way, the configuration of the environment (and the feature itself) are a long way off from production. This environment, where it exists, should be a safe place for testers to work out kinks in their automation. But the primary reason for this environment would be to shield testers from the kinds of disruptions they might face if they were running tests in Development.
By now the feature is more or less final. The Staging environment is supposed to mimic production — “supposed to” being the operative phrase. This is where your application's new features get tested before going to Production. You shouldn’t be discovering new defects here, but it’s rarely (if ever) the case that Staging behaves or performs exactly like its Production counterpart, so the end-to-end tests running here may need either further modification to fit this environment’s idiosyncrasies or even to be disabled entirely.
These days, it’s considered an advanced practice for companies to do full regression testing in Production but there are many advantages to doing so, if for no other reason than the fact that lower environments aren’t perfect replicas. There’s a trade off to running end-to-end tests here: On the one hand, this is where bugs really matter; on the other hand, it would be better if bugs never got here in the first place.
The primary purpose of the Testing environment is observability, so applications running in this environment are more likely to be configured to be testable than in higher environments. For example, the Testing environment may have bypasses for such features as multi-factor authentication, single sign-on, and even login pages. Some features might be disabled for practicality, such as scheduling or batch processing. Eventual consistency might have a slower guarantee, or, in some cases, none at all.
On the other hand, Staging is meant to simulate Production but it is rarely an exact replica. The similarity to Production makes Staging the ideal location to simulate errors in our applications to assure that they are robust and that our data is recoverable. Tests that would be risky to perform in Production also run here — for instance, tests that cause transactions to hit the company's accounting systems. Whenever a tester simulates an error condition here, there is a chance that the error could impact other tests from the moment of the simulation and, possibly beyond. For this reason, disruptions on Staging should be (and usually are) infrequent and well coordinated.
In Production, testers typically run “smoke tests” and other activities that can be safely performed without impacting critical systems. Production testing has to be done with great care to assure that systems can handle test data and traffic correctly and securely without impacting daily operations. End-to-end tests may need to be further modified or disabled to make sure they don’t adversely impact the customer experience or day-to-day operations. Because of these constraints, writing automated tests that run in Production can be more complicated than for other environments.
Since we have established that application performance and functionality will differ from environment to environment, let’s look into how specific conditions of those environments may lead to false positives. This is not an exhaustive list but rather an illustration of the most common environmental characteristics that can contribute to the differences between test runs across environments. Your own environments may exhibit these characteristics to a greater or less extent.
Indisputably, the hardest part of end-to-end testing is getting the data in a place to test, and cleaning it up properly. This needs to be done separately for each environment since the alternative would be to either have all environments share a database or to make sure all environment databases are synched. Both of these are untenable and almost certainly unadvisable.
Depending on what automated gates your teams have in their deployment processes, some or all of the environments they use may have differing versions of the application at any given time. For example, the UI itself will often be different in some ways, since teams tend to frequently modify, add and/or remove features as they promote environments.
Sometimes, a feature headed to Production may be intended to run in a limited way, perhaps once a month or at night when traffic is low, but the automated test may need to force it to run several times a day. Depending on the environment, the test may need to, for example, run on an intentionally more limited set of data in one environment than on another.
Frequently, higher environments will have greater memory allocation and faster processing speeds. If an automated test runs against a component that runs well in a higher environment, it might fail in a lower environment because the component times out. Frequently, lower environments are intentionally underprovisioned. However, under-provisioning policies are not always well communicated or broadly known.
Under-provisioning however might lead to unintended consequences. Each environment has its own usage patterns that can cause differences between them. One very common problem is that individual applications might have a disk filled on a given environment that may cause the application using that disk to error in certain situations. Any tests associated with the feature throwing that error would fail in the environment where the disk is full. Or perhaps your tests all share a common database user and a surge in testing activity causes the environment to run out of available connection pool threads.
Depending on how strict the rules are for graduating applications from one environment to the next, tests that pass in one environment may fail in another because they haven’t been fully debugged. The less teams adhere to strict and consistent rules for promotion, the less reliable a test will be from one environment to the next.
Especially when tests rely on third-party services or appliances, restrictions on the usage of those services or appliances can impact testing. Throttling on services or limitations on the number of appliances or licenses might mean certain tests might fail or, preferably, be disabled in a given environment. For the full suite to pass, the tests that call those third-party services would have to be limited to run when availability is assured or otherwise turned off completely in the handicapped environment.
Various configurations might cause a test on given environment failure. This can be as simple as a feature flag or A/B flag being left on, a simple network routing change for a given IP range, or any number of custom application-specific configuration problems. The simple fact is that configurations usually vary from environment to environment.
Automobile manufacturers test new models by putting a car through its paces on various terrain and under various conditions. A new model that handles well on dry pavement may skid under wet conditions. In the same way, when an automated test runs in different environments, the cause for any failure may not be the same because the conditions are not the same.
Now, don’t be mistaken: We’re certainly not arguing to get rid of your environments so that you don’t have to create separate end-to-end tests on each. Far from it! Environmental differences are critical for software development.
Hopefully by now we’ve demonstrated that you can’t simply plug an end-to-end test written for one environment onto another environment and use that test to decide if a build is production-ready.
It would be nice if end-to-end tests could be written from a layer of abstraction that prevented subtle, unimportant differences from causing them to fail. But that level of consistency would be difficult (maybe impossible) to maintain across multiple versions of multiple applications simultaneously. What’s more, there are those who might regard this practice as “code smell.”
The fact is that developers and testers sometimes set up other additional environments (such as Demo or ephemeral Preview environments) precisely because they want to either avoid disruptions caused by others or create invasive disruptions, such as those caused by load or spike testing.
We recommend running end-to-end against a single environment. The production environment (check out DoorDash's approach) will give you the best results for such activity, but if your organization isn’t there, we recommend the highest pre-production environment.
That said, if our customers want us to run their tests on multiple environments, we at QA Wolf have found that the best way to manage the differences between environments is to write and maintain unique tests for each environment under test. That way we can control not only which test runs in a particular environment but how they behave in each environment.
We have learned that we can’t simply run the same test on different environments and trust that the results will be the same; nor can we assume that by fixing a flaky test for one environment we assure that it’s ready to run successfully in every environment in your pipeline.
Finally in terms of our costs, we need to charge more because we have to triage failures in proportion to how often the tests run, which will be more frequent when you add more environments to test against.