Products

Enterprise end-to-end

Testim

Custom web and mobile apps

Testim Salesforce

Test Management

qTest

Enterprise test management

Test Management for Jira

Native Jira test management

Vera

Digital validation

Mobile Testing

Testim Mobile

Custom mobile app testing

Tosca Mobile

End-to-end mobile testing

Performance Testing

NeoLoad

Load and performance testing

Data & Quality Intelligence

SeaLights

LiveCompare

Change intelligence for SAP

Data Integrity

Explore all products

Featured webinar

Inside Tosca's agentic test automation capabilities

Agentic Test Automation for Tosca revolutionizes testing with industry-leading agentic AI technology. Join our webinar to hear from our experts.

Learn more

Solutions

Featured webinar

The integrated toolchain that unlocks speed and quality during SAP migration

Modernize SAP faster with an integrated toolchain approach that builds quality and risk intelligence into every step.

Learn more

Services & Support

Resources

Contact

Company

Management team Careers News Locations Partners

Blog Customer Portal

Trials & demos

Performance testing

Software Performance Failures due to Corporate Controversy

Many systems can manage unexpected performance demand and continuing operations. Read about two failures that resulted from a single corporate controversy.

Jan. 31, 2023

Author: Bryan Cole

It’s remarkable how many systems can manage unexpected performance demand, scaling resources and continuing operations. Yet truly unexpected demand can still apply enough stress to a system that it crashes. Here, we’re going to examine two failures that resulted from a single corporate controversy.

What happened?

Wizards of the Coast, a subsidiary of Hasbro, owns the Dungeons and Dragons property. In early January 2023, the news broke that they planned to update their Open Gaming License. While the old license offered a perpetual royalty-free license to create content using the rules from Wizards of the Coast, the new version had massive limitations associated with it and revoked the old perpetual license.

At the time of this writing Wizards of the Coast has yet to issue the new license, but the damage has been done. The changes were not well received by the community, and several influencers suggested that people should cancel their online subscriptions to dndbeyond.com. Shortly afterward, their biggest competitor, Paizo, released a rebuttal to the rumored changes.

Both companies experienced an outage on their websites as a result.

On dndbeyond.com, the subscription termination service failed under load on the system, leaving many users unsure as to the status of their subscription. On the Paizo site, their blog platform simply collapsed, showing only a 502 bad gateway error message. This series of events offers a fabulous use case detailing the nature of truly unprecedented volumes of user traffic.

Performance engineering is the discipline of ensuring that highly scalable applications continue to function. In this case, both companies certainly conducted performance testing on these platforms prior to these events. However, performance engineering is much more than performance testing. It should necessarily include these types of scenarios, as well as how to mitigate them in production.

Remediation Strategies

Running large-scale performance tests to test the boundary and failure conditions of an application is often overlooked. The goal is to run the system to such extreme loads that it fails entirely, while monitoring the application to observe the symptoms it exhibits prior to the failure. Monitoring is vital in this instance, as very rarely do organizations have the budget to replicate the full production environment for application delivery teams.

When you observe the failure symptoms of an application you can implement the second part of the strategy, which is to limit the load able to access the system under extreme traffic situations. User load is redirected to a static page that contains some variation of the text “Due to unexpectedly high volume, the site is unavailable at this time. Please try again later.” This is vastly preferable to a 502 error page.

Why don’t companies do this?

There are several reasons, but there are two that are more prominent than others.

First and foremost, these sorts of extreme load events are seen as extraordinarily unlikely to ever occur. Testing for scenarios that are less likely than being struck by lightning while simultaneously winning the lottery doesn’t make financial sense.

Secondly, these sorts of tests carry additional costs. Limits on usage as part of software contracts, as well as the cost of running infrastructure at extremely high usage rates, can make testing costly – but not as costly as application failure.

The unlikely nature of extreme load events for many applications, combined with the high cost of running these sorts of tests, means that many organizations do not stress systems to the breaking point to identify failure states, or put guardrails in place to remediate these conditions and ensure application survival.

What can we do?

Application delivery and performance engineering are complex activities, and there is no simple solution. Ideally, you can deduce operational parameters from normal testing and establish the same guardrails. In many cases I would advocate for putting such protective barriers in place well before the system is in danger. As an analogy, consider a fence that keeps us away from the edge of a cliff. Ideally, it’s placed well back from the edge. Even though there is potentially a lot of ground to safely walk on, the fence is there to keep us safe.

It’s important to note that putting guardrails in place can lead to higher operation expenses. Limiting your servers to 80% CPU usage, for example, means that you might require 12 servers instead of 10. At this point it becomes a business decision to understand the value of ensuring application reliability, where users can hit the refresh button and still access the application, versus the cost of letting it fail. In this case, I do not believe that the Paizo team views their blog site as business critical. Once service is restored, they’re unlikely to experience such a surge of user traffic again.

Final Thoughts

This is a compelling story for performance engineering teams to relate to their management structures. No matter how much you test, without safety precautions an application can almost always be brought to the failure point. Closer communication with your colleagues in IT Operations, helping them understand application stress factors so they can limit traffic once those conditions begin to manifest, can lead to applications that continue to work for at least some of your user community.

Author:

Bryan Cole

Director of Customer Engineering

Date: Jan. 31, 2023

Author:

Bryan Cole

Date: Jan. 31, 2023

Topics:

Performance testing

Software Performance Failures due to Corporate Controversy

What happened?

Remediation Strategies

Why don’t companies do this?

What can we do?

Final Thoughts

Performance testing

Bryan Cole

Director of Customer Engineering

Performance testing

Bryan Cole

Recommended

You might also be interested in...

Accelerating Oracle Applications innovation: The ROI of smarter testing

Achmea’s transformation journey to continuous testing

Inside Tosca’s agentic test automation capabilities

Forrester Consulting research: The total economic impact of Tricentis solutions for Oracle Applications testing

Data you can trust: Solving the integrity puzzle