Load testing

Load Testing Retail Sites: How to Survive the Holiday Surge

It’s the stuff of eCommerce retail nightmares. You “flip the switch” on a big, heavily advertised promotion, and your website promptly crashes under the load, costing you customers, revenue, and brand reputation. Though it’s more likely that a website will simply slow during heavy traffic, that alone is enough to cause damage. A recent Google study states that “53% of mobile site visits result in a user leaving a page that takes longer than three seconds to load.” Source.

That metric is astounding. There are few day-to-day scenarios where 53% of users would deem 3 seconds as “too long.” That is, however, the reality that retailers and websites face on a daily basis. And during the holidays, a particularly heavy season of online shopping, the concern can become more pronounced than ever.

Rather than giving in to the threat as inevitable however, there are active steps retailers can take to protect their websites and brands against a slow-down or outage due to heavy traffic. Tim Koopmans of Flood IO weighs in:

What can retailers do to avoid the slow-down during heavy traffic?

Production traffic is notoriously difficult to model and simulate. Estimates are often wrong and retailers can end up in a situation where customers experience slow or degraded performance—or worse—downtime or outages. Retailers should plan for a number of fallback scenarios in the case of unexpected or excessive load:

  1. Scalable infrastructure and application code that can be provisioned in near real time which can respond (automatically or via manual triggers) to surges in demand. Prior to provisioning, everything needs to be thoroughly load tested for things like scaling policies, lifecycle hooks, automatic code deployment, health checks, and target tracking.
  2. Cross region failover for scenarios that involve regional overload / outages. Load testing should include DNS failover and dynamic load balancing in its scope.
  3. Varying levels of caching. Most CDN providers will let you increase the level of caching in front of your origin servers, beyond just simple static assets, and can provide sophisticated rules for page level caching. Load testing needs to ensure changes to caching levels do not break application functionality, particularly for sites which rely on authenticated / transactional requests from your customers to the backend.
  4. For some retail applications, e.g. ticketing, a third party queuing system is ideal to insulate origin servers from excessive demand by offloading and queuing customers with a customer friendly holding page, where you can then control the throughput back to your origin servers.

If there is a temporary outage, what is the best way for a retailer to handle the situation?

If customers do experience an outage, it is important for retailers to use load testing to confirm that the following outcomes can be achieved:

  1. Customers can still access your status page and incident management tool. This should be hosted separately from your production infrastructure (in another region) and must always be 100% available. There are a number of 3rd party managed services which provide this type of status page.
  2. Customers need to be informed throughout the duration of the outage. Social media like Twitter and Facebook is a great way to communicate in these circumstances. Ideally your incident management tool has the option for customers to subscribe to updates (or opt out).
  3. Resuming services after a high volume outage can be difficult because it is easy to overwhelm origin servers when services are restored. The ability to rate limit or control load is useful in these scenarios, as is the ability to prioritize customer segments or regions for restoration.

What else should retailers be doing NOW to prepare for the traffic?

Load testing is the best way to mitigate production risk to your application. Load testing should not only focus on planned / estimated production work load models, but also consider the game day scenarios listed earlier. Ideally you should also be load testing for scenarios like high availability / dynamic scaling, DDOS, cross region failover, surge or spike scenarios in your load test effort. Some of these are extremely difficult to test in isolation or behind the corporate firewall (e.g. on controlled networks) which is why platforms such as Flood IO are extremely popular for simulating production load across real infrastructure in different scenarios.