Learn

Resilience testing: a complete guide for testers

resilience testing

In recent times, users of both in-house and commercial systems have become less tolerant of unstable systems and those that are difficult to use. This sea change in user expectations has been fueled by the increasing availability of online systems, particularly desktop and mobile apps for e-commerce and social media. If you don’t like this one, there is always another.

In business, organizations have been under great pressure since the COVID-19 pandemic to use online systems to improve internal operational efficiencies and deliver new and innovative applications to both suppliers and customers. E-commerce is a great example.

The increase in the availability of systems and apps and the rate at which they are hitting the market has meant two things. First, it means that systems must now go from concept to delivery in a much shorter time and be focused, stable, and usable.

The second implication is that testing and quality assurance are vital to ensure systems are as usable and stable as possible from the outset. This need has been exacerbated by the need for systems to operate on multiple platforms and under different operating systems. Smart devices, laptops, and desktops operating Android and Windows, not to mention Apple, are common environments.

Fail to meet user expectations, and they will abandon your websites and reject your systems.

While there are many parts to a full test, resilience testing is a critical aspect of software quality assurance, ensuring that systems remain functional under stress, recover quickly from failures, and maintain service availability even in the face of disruptions. It is sometimes called chaos testing.

Here’s a complete guide for testers on how to approach resilience testing effectively.

What is resilience testing, and why is it vital?

A good working definition of resilience testing is that it evaluates a system’s ability to handle and recover from failures, such as hardware malfunctions, software bugs, network issues, and unexpected spikes in load.

Nothing upsets a user more than an app that is unpredictable in performance or fails and prevents them from carrying on. Resilience testing is needed to ensure that doesn’t happen.

When designing a test environment, you should also be aware that several test scenarios and sets of metrics will be needed in a multi-platform scenario. As noted above, many applications now need to be available for smartphone, tablet, and desktop environments in a variety of operating platforms, especially Android, Apple and Apple Mac, and Microsoft Windows.

Tricentis has put together a useful primer on testing in general and resilience testing.

Unlike performance testing, which focuses on system speed and capacity, resilience testing assesses how robust the system is in maintaining service levels and quickly recovering from disruptions.

Resilience testing vs. performance testing

It’s important to differentiate between performance and resilience testing. Unlike performance testing, which focuses on system speed and capacity, resilience testing assesses how robust the system is in maintaining service levels and quickly recovering from disruptions.

Let’s look at each in a little more detail:

Performance testing

Simply put, the objective of performance testing is to ensure that a system meets specific performance criteria under various conditions.

This means testing will fall into several categories:

  1. Load testing determines how a system performs under expected user loads. It helps in identifying the maximum operating capacity of an application.
  2. Stress testing tests the system under extreme conditions to see how it behaves under stress. This includes pushing the system beyond its normal operational limits to understand its breaking point.
  3. Scalability testing evaluates how well the system scales with increasing load, either by adding resources or by optimizing existing ones.
  4. Endurance testing tests the system’s performance over an extended period to ensure it can handle long-term operations without degradation.
  5. Volume testing assesses the system’s performance with a large volume of data.

Resilience testing

On the other hand, resilience testing has the objective of ensuring that a system can recover from failures and continue to operate effectively.

Again, there are several test types to be considered:

  1. Fault tolerance testing tests how well the system can handle hardware or software failures without affecting overall functionality.
  2. Recovery testing assesses the system’s ability to recover from crashes, data corruption, or other disruptions.
  3. Chaos engineering introduces random failures or faults into the system to observe how it reacts and recovers, aiming to build systems that are robust and can handle unexpected issues gracefully.
  4. Disaster recovery testing validates the effectiveness of backup and recovery procedures to ensure data integrity and system availability in the event of a major failure.

In summary, performance testing is focused on ensuring that a system operates efficiently and effectively under normal and peak conditions, while resilience testing is centered on ensuring that a system can handle and recover from unexpected failures and disruptions.

Both are essential for creating a reliable and high-performing system, but they address different aspects of system robustness and operational effectiveness.

Key objectives of resilience testing

  1. Ensure continuity: Test the system’s ability to continue operations even when parts of it fail.
  2. Recovery capability: Assess how quickly and effectively the system can recover from a failure.
  3. Fault tolerance: Measure the system’s ability to handle unexpected faults without crashing.
  4. Graceful degradation: Evaluate how the system degrades when under stress, ensuring that it fails in a controlled and predictable manner.
  5. Identify weak points: Discover vulnerabilities or weak points that could lead to failures.

Steps in conducting resilience testing

1. Define the scope

  • Identify critical components: Focus on the components critical to business operations.
  • Set clear objectives: Understand what you want to achieve with the tests (e.g., minimizing downtime, improving recovery time).

2. Plan the testing scenarios

  • Failure scenarios: Simulate different types of failures, such as server crashes, network outages, and hardware failures.
  • Load conditions: Test under various load conditions to see how the system behaves under stress.

3. Design test cases

  • Automated failures: Use scripts or tools to automate the introduction of failures (e.g., Chaos Monkey).
  • Manual interventions: Manually trigger failures to observe system behavior and recovery.

4. Execute the tests

  • Monitor system behavior: Closely observe the system’s response to the induced failures.
  • Log failures and recovery: Record any failures and how quickly the system recovers.

5. Analyze results

  • Measure downtime: Calculate the time taken to recover from failures.
  • Identify bottlenecks: Find the root causes of slow recovery or unexpected failures.
  • Evaluate fault tolerance: Determine how well the system can handle multiple concurrent failures.

6. Report findings

  • Document weaknesses: Clearly outline any vulnerabilities or weaknesses discovered during testing.
  • Provide recommendations: Suggest improvements to enhance system resilience.

7. Implement improvements

  • Fix identified issues: Work with development teams to resolve the issues found during testing.
  • Retest after changes: Perform another round of resilience testing after changes have been made to ensure that the system’s resilience has improved.

Resilience testing tools are available both for proprietary environments and as open-source applications.

Resilience testing vs. performance testing

There are several tools available as testing tools, some free and some available for purchase. They can be multi-platform and multi-operating environments or specific to a particular browser environment. A tool can be specific to a particular test area—for example, image rendering or memory usage—or can cover a range of or all test metrics.

Resilience testing tools are available both for proprietary environments and as open-source applications.

NeoLoad

One market leader covering most test scenarios, including resilience testing, is NeoLoad.

Applications are all built differently for different environments and application areas, but they all need to perform. NeoLoad is a suite of tools that simplifies and scales performance testing for everything from APIs and microservices to end-to-end application testing through innovative protocol and browser-based capabilities.

Conclusion

Resilience testing is essential for ensuring that a system can handle unexpected challenges without catastrophic failure. By systematically planning, executing, and refining resilience tests, testers can help build more robust and reliable systems that are better equipped to handle the unpredictable.

This guide should provide a comprehensive understanding of resilience testing, empowering you to implement effective testing strategies in your projects.

This post was written by Iain Robertson. Iain operates as a freelance IT specialist through his own company, after leaving formal employment in 1997. He provides onsite and remote global interim, contract and temporary support as a senior executive in general and ICT management. He usually operates as an ICT project manager or ICT leader in the Tertiary Education sector. He has recently semi-retired as an ICT Director and part-time ICT lecturer in an Ethiopian University.

Author:

Guest Contributors

Date: Nov. 26, 2024

Related resources

You might also be interested in...