Azure/Office 365 Outages: The IT Admin's Guide to Continuity
A major outage in the US takes down a key Microsoft datacenter and a host of cloud services in the process. What to do when the “cloud” goes down?
Every vendor offering a cloud-based solution pours ungodly amounts of money into redundancy to ensure a single failure or even multiple failures go unnoticed by customers connected to their services. For months it appears as if nothing can go wrong. And then…it does.
Recent Microsoft Azure and Office 365 Outage
This week, Microsoft experienced Azure and Office 365™ outages due to severe weather (lightning) taking out cooling systems in data centers located in San Antonio, Texas. This forced servers and services to shut down. The outage was focused on the South-Central U.S., but it affected customers around the globe. More specifically, the outage affected Exchange Online, SharePoint Online, Teams and a variety of other solutions with Azure Active Directory (AD) being a problem for identity management, as well (which connects back to Office 365).
After most services were restored, customers were receiving error messages for Outlook and Skype saying they were being throttled due to a change to Azure AD for Office 365 authentication.
Without belaboring the situation, the real question is: “What did we learn from this outage?”
Cloud “haters” will tell you to avoid the cloud. That’s ridiculous at this stage of the game. When an airline has an incident do we stay out of the air? No, we learn from the failure. When it comes to cloud-based solutions it’s important to understand that there is no perfect world where services never go down. Azure and Office 365 have gone down and will continue to go down. Microsoft will learn and improve, and we appreciate their efforts. But what does it mean when you have to cope with reality when an outage hits?
How to Prepare for Cloud Outages
You may have a recovery plan for your on-prem environment – what happens when you experience a cloud outage? Do you have a plan to recover?
Upon first hearing news of an outage you’ll likely check your Internet connectivity, ensure DNS functionality, confirm it’s both in-office and mobile users that have issues, determine how many services are down, and then what?
I envision an IT admin and their team getting repeated calls from folks who tell them their email is down along with a host of other services. I see the IT folks responding back with, “we know, but there is nothing we can do,” while rechecking Twitter, Reddit, DownDetector.com to see if the outage is being handled. They may also check their Office 365 health report in the Office 365 admin center but that’s often disappointing because it may take hours before a visible outage is classified as such so you’re better off getting your news off the “street.”
Is this the new “normal” that all in IT have to get used to? Well, I can’t speak for all Office 365 services, but with regard to email, you don’t have to just sit back and wait for services to be restored. For years I’ve been telling folks to consider Mimecast as their go-to solution for enhanced security, compliance and continuity.
What does this same outage look like for a Mimecast customer? When the services went down the IT admin and their team would receive an alert (SMS or secondary email). With a single-click panic button they can switch tracks from Office 365’s Exchange Online over to the Mimecast cloud for sending and receiving of email. Outlook users and mobile users with the Mimecast app would never even notice an outage. Email would continue to flow in and out as before. Beyond that, if Mimecast was being used for security and archive purposes you would not see a drop in security or compliance by switching tracks. You would continue to be just as secure and compliant as you were a moment ago. When the outage is resolved, IT can switch tracks back to Office 365, email would be updated, and duplicates avoided. Again, all without end-users knowing there was a massive outage that took out services globally.
These latest incidents provide a clear reminder of the need for organizations to build their own redundancy rather than rely on a single vendor. All organizations, including Microsoft, need to consider what downstream effects there may be from losing critical services caused by a technical failure, human error or mother nature.
Subscribe to Cyber Resilience Insights for more articles like these
Get all the latest news and cybersecurity industry analysis delivered right to your inbox
Sign up successful
Thank you for signing up to receive updates from our blog
We will be in touch!