Amazon Web Services Outage: Who is Responsible for Your Organization’s IT Resilience?
The recent AWS outage is a strong reminder of the risks that come with overdependence on a single cloud service.
- Amazon Web Services (AWS) recently experienced a wide outage that impacted thousands of customer sites and services.
- As organizations move IT workloads to the cloud, they need to consider the level of required IT resilience as part of the transition.
- Not all applications and services require the same level of resilience.
- Interdependence and overdependence are drivers of IT fragility. Anti-fragility can be designed for in most situations, even with public cloud services.
The recent report of Amazon Web Services (AWS) experiencing an extensive outage highlights the inherent challenge of public cloud services. Along with all of their benefits — and there are many — there are also the risks and realities of control loss which need to be considered and mitigated. This single outage centered in Virginia impacted thousands of customer sites and services, such as Adobe, Roku, Twilio, Autodesk and others.
It is a fact that cloud services are, as some critics would say, just the use of other people’s computers. But while some IT operations can be outsourced, ultimate responsibility cannot. When cloud services operate as designed, there are huge benefits to go around. But when they don’t, their interdependent, fragile nature comes into painful focus. IT architects from AWS customers (similarly for other cloud services) are often literally designing in single points of failure when they put their IT service eggs into a single computational basket.
Ironically, perfectly highlighting this interdependent fragility, AWS’ own Service Health Dashboard – where AWS updates the public about service issues – was down in this outage because it is dependent on an underlying service that was knocked out. Given the market presence and size of AWS, estimated to be 33% of the public cloud infrastructure market, the outage was described by some, somewhat hyperbolically, as a takedown of the web. While not exactly true — the web itself was just fine — given the accelerating move of IT workloads to the public cloud, in particular to the triumvirate of AWS, Google, and Microsoft Azure, this inherent resilience interdependency conundrum can’t be ignored. Note, per 451 Research shown in Figure 1 below, that 52% of workloads — up from 26% in 2020 — will be hosted in public cloud environments by 2022. Thus, this 'eggs in a single basket' problem is likely to get worse in the very near future.
Figure 1: 451 Research – IT Workload Distribution Trends
How do we wring the benefits from the public cloud while further mitigating the resilience risks? Cleary using a single public cloud service does not guarantee 100% service uptime. But resilience can be engineered into almost any system; it is more of an issue of “how” and “how much will it cost.” Where to start? You first need to honestly assess the resilience requirements of your organization and its applications. Not all organizations and applications are created equal, and thus you shouldn’t treat their resilience requirements the same. What are the uptime requirements for your order entry website versus your internal HR management system? Both are important systems but are likely not the same level of importance to your organization from the point of view of uptime. Customer orders may never come back, but your employees probably will.
Remember the “9s.” How many “9s” do you need and are willing to pay for? A good place to start is the high availability chart in Wikipedia, where you can see the amount of annual downtime you will experience from one “9” (90% uptime = 36.53 days of downtime in a year) to nine “9s” (99.9999999% uptime = 31.56 milliseconds of downtime in a year). I recognize that the expected “9s” of a particular public cloud service is hard to discern (probably somewhere within these two extremes!), but using this chart as a starting point is a good way to ground and verify your own resilience requirements.
With these uptime requirements in place, this will give your IT architects the beginnings of a set of requirements that they can action. Maybe they can design greater resilience within the selected public cloud service, or maybe a separate cloud service will need to be set up to cover for downtime or failure. Or maybe some services shouldn’t be moved to the public cloud in the first place. Or maybe some hybrid public/private cloud setup is best. The answer is that it just depends. But simply hoping for sufficient IT resilience is a recipe for bad surprises.
A popular service in the area of improving application uptime is email continuity. What would hours of email downtime in a given outage do to your organization’s productivity? How many “9s” does your organization need for its email? A continuity service adds redundancy and resilience to your primary email management system, whether it is Microsoft 365, Exchange on-premises, or something else. This way, in the case of planned or unplanned downtime of M365 or on-premises Exchange, your organization can continue to send, receive, archive and secure your email without interruption. Ultimately, the answer of what level of IT resilience for a service or application is needed is only something your organization can answer. But clearly a one-size-fits-all approach is not the best way forward.