When you mention testing in production you might recall the days when developers snuck releases past the QA team in hopes of keeping the application up to date, but in reality it only caused a buggy mess. And users were the ones who suffered. For this reason, most businesses avoid testing in production altogether because it’s too risky for the end user.
But there are problems with not testing in production, too. Test environments are rarely built out to the same level as production environments, so they can never really achieve the scale that you’d see in “real life.” Plus, testing environments can easily get stale and out-of-date — and as a result, you aren’t testing what you ought to be.
Sandy Mappic, Sr. Support Engineer at AppDynamics, frames it perfectly: “When a Formula 1 team designs a car in a wind tunnel and tests it on a simulator pre-season, they don’t assume that the performance they see in test will mirror the results they see in the race. Yet, that is pretty much what happens today in application development lifecycle.”
Testing in production is an important core competency for any world-class test team to cultivate. In this post we offer some practical tips to make testing in production an achievable reality — and to mitigate the obvious risks that it exposes.
Testing in production gone wrong
Sometimes learning a best practice begins by understanding a worst practice. An excellent example of how testing in production can go wrong was reported by the IBM WebSphere team a few years ago. One of their customers was a major bank, and their development team set up a test server within their production environment. In fact, the test application was running on the same installation of WebSphere Application Server as their production application.
Ports were configured to avoid conflict, but by and large these two application servers were far more intertwined than they ought to have been. They shared log files, hardware resources, and their binary codebase. This meant that any upgrade to the software development kit (SDK) would disrupt both application servers. While frequent updating to the test system was necessary, the repeated disruption to the production system was intolerable since it affected users greatly.
IBM’s recommendation to this bank was to build an identical copy of their production center. Thanks for the advice, but you are definitely going to need a large pile of money to pull that one off.
Tips for testing in production the right way
So here are some things you can do to develop robust procedures for testing in your production environment without having a severe impact on your users.
1. Make layers – like a stack of pancakes
The idea of “testing in production” can actually mean different things. Are you testing a bunch of test servers from within your production data center? Or are your test applications running separately on top of your production platform? Or are you truly running live tests against 100% production-deployed code? The answer should be all of these. Layer your production testing to give you the ability to test different aspects of the production environment in different ways. Then match up your test cases so as to minimize the impact that your testing — and maintenance of the test environment — has on production users.
2. Time your tests when usage is light
Performance testing can have an impact on your entire user base if you let it. It can make a server environment sluggish, and that’s something no one wants. Study your analytics and determine the best time to schedule your tests. For example, look for the lowest levels of:
- Number of users on the site
- Revenue generation across the site
- Resource-intensive processes within the environment
3. Collect real traffic data and replay it to your test systems
Make sure to use actual traffic data you have been collecting in production (such as user workflows, user behavior, and resources) to drive the generation of load for test cases. That way when you exercise your load tests within your production environment, you’ll have confidence that the simulated behavior is realistic.
4. Introduce a chaos monkey
According to Netflix engineers Cory Bennett and Ariel Tseitlin, “The best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient.” Netflix built what’s called a Chaos Monkey into their production environment. This code actually introduces failures into the production environment randomly, forcing engineers to design recovery systems and develop a stronger, more adaptive platform. You can put your own chaos monkey in place because Netflix released their code to GitHub.
5. Monitor like crazy
When you are running a production test, keep your eye on key user performance metrics so that you know if the test is having any kind of unacceptable impact on the user experience. Be prepared to shut the test down if that’s the case.
6. Create an “opt-in” experience for experimental testing
A great way to test how your application performs with real users is to have some “opt-in” to new feature releases. This will allow you to monitor and collect data from real-time users and make adjustments to your testing strategy accordingly, without as much concern about impacting their experience. After all — they’ve already agreed to become test subjects, so a little hiccup here and there won’t come as a surprise.
Testing in production is a good thing
Make sure to encourage your team to integrate testing in production into your strategy. It’s a great way to get exposure to real-world scenarios and to find the bugs you normally wouldn’t come across in a testing environment. Remember, just because you have tested the application’s performance in the lab does not mean you will see the exact same performance in production.
This post was originally published in 2014 and was most recently updated in July 2021.