It’s remarkable how many systems can manage unexpected performance demand, scaling resources and continuing operations. Yet truly unexpected demand can still apply enough stress to a system that it crashes. Here, we’re going to examine two failures that resulted from a single corporate controversy.
Wizards of the Coast, a subsidiary of Hasbro, owns the Dungeons and Dragons property. In early January 2023, the news broke that they planned to update their Open Gaming License. While the old license offered a perpetual royalty-free license to create content using the rules from Wizards of the Coast, the new version had massive limitations associated with it and revoked the old perpetual license.
At the time of this writing Wizards of the Coast has yet to issue the new license, but the damage has been done. The changes were not well received by the community, and several influencers suggested that people should cancel their online subscriptions to dndbeyond.com. Shortly afterward, their biggest competitor, Paizo, released a rebuttal to the rumored changes.
Both companies experienced an outage on their websites as a result.
On dndbeyond.com, the subscription termination service failed under load on the system, leaving many users unsure as to the status of their subscription. On the Paizo site, their blog platform simply collapsed, showing only a 502 bad gateway error message. This series of events offers a fabulous use case detailing the nature of truly unprecedented volumes of user traffic.
Performance engineering is the discipline of ensuring that highly scalable applications continue to function. In this case, both companies certainly conducted performance testing on these platforms prior to these events. However, performance engineering is much more than performance testing. It should necessarily include these types of scenarios, as well as how to mitigate them in production.
Running large-scale performance tests to test the boundary and failure conditions of an application is often overlooked. The goal is to run the system to such extreme loads that it fails entirely, while monitoring the application to observe the symptoms it exhibits prior to the failure. Monitoring is vital in this instance, as very rarely do organizations have the budget to replicate the full production environment for application delivery teams.
When you observe the failure symptoms of an application you can implement the second part of the strategy, which is to limit the load able to access the system under extreme traffic situations. User load is redirected to a static page that contains some variation of the text “Due to unexpectedly high volume, the site is unavailable at this time. Please try again later.” This is vastly preferable to a 502 error page.
Why don’t companies do this?
There are several reasons, but there are two that are more prominent than others.
First and foremost, these sorts of extreme load events are seen as extraordinarily unlikely to ever occur. Testing for scenarios that are less likely than being struck by lightning while simultaneously winning the lottery doesn’t make financial sense.
Secondly, these sorts of tests carry additional costs. Limits on usage as part of software contracts, as well as the cost of running infrastructure at extremely high usage rates, can make testing costly – but not as costly as application failure.
The unlikely nature of extreme load events for many applications, combined with the high cost of running these sorts of tests, means that many organizations do not stress systems to the breaking point to identify failure states, or put guardrails in place to remediate these conditions and ensure application survival.
What can we do?
Application delivery and performance engineering are complex activities, and there is no simple solution. Ideally, you can deduce operational parameters from normal testing and establish the same guardrails. In many cases I would advocate for putting such protective barriers in place well before the system is in danger. As an analogy, consider a fence that keeps us away from the edge of a cliff. Ideally, it’s placed well back from the edge. Even though there is potentially a lot of ground to safely walk on, the fence is there to keep us safe.
It’s important to note that putting guardrails in place can lead to higher operation expenses. Limiting your servers to 80% CPU usage, for example, means that you might require 12 servers instead of 10. At this point it becomes a business decision to understand the value of ensuring application reliability, where users can hit the refresh button and still access the application, versus the cost of letting it fail. In this case, I do not believe that the Paizo team views their blog site as business critical. Once service is restored, they’re unlikely to experience such a surge of user traffic again.
This is a compelling story for performance engineering teams to relate to their management structures. No matter how much you test, without safety precautions an application can almost always be brought to the failure point. Closer communication with your colleagues in IT Operations, helping them understand application stress factors so they can limit traffic once those conditions begin to manifest, can lead to applications that continue to work for at least some of your user community.