Archive & Data Protection

    Latest Microsoft 365 Disruption Highlights Hidden Complexity of Cloud Continuity Planning

    Global outage affects Microsoft 365 service availability, hampering productivity and communications for organizations around the world.

    by Richard Botley
    ms365-down-detector-graph.png

    Cloud-based Microsoft applications, including Teams, Office and Outlook, were struck by a worldwide disruption on Monday – putting its customers’ business continuity and resilience plans to the test.

    According to DownDetector reports, issues appear to have started at around 5 pm ET with services not returning to normal for at least seven hours. Microsoft stated the incident was resolved at midnight ET, although some users continued to report issues deep into the next day. Additionally, a second issue was confirmed on Tuesday stating customers were unable to connect to the Exchange Online service on their mobile devices.

    ms365-tweet.png
    ms365-down-detector-graph.png

    Image Source: Down Detector 

    As is common in these incidents, Microsoft rolled back some recent changes to mitigate the issue. There is no evidence of a malicious third party or attack – just an aggressive schedule of patches and upgrades in a highly complex global infrastructure.

    This week’s issues follow a number of incidents earlier this year. Teams had a significant outage in March, Microsoft 365 in May and Google’s G Suite in August. It’s widely understood that human error and technical failure will continue to happen and affect all services at one point or other. So why are many organizations continuously unprepared for when it happens?

    Hidden complexity

    The answer may lie in a failure to comprehend or quantify the risks of hidden complexity.

    Organizations are moving to Microsoft 365 in droves to help simplify their IT estate, reduce costs and boost productivity. As part of this, Exchange Online has fast become the de facto standard for cloud business email, accelerated by the Covid-19 digital transformation boom.

    The problem here is that IT complexity has in fact increased yet is now hidden from sight. A vast web of interconnected systems and people could ruin your Monday productivity without you knowing that it’s coming.

    Let’s take email and specifically continuity as a prime example. Consider an enterprise IT leader choosing to use Exchange Online rather than maintaining their own Exchange server in a physical or virtual environment. From the IT team’s point of view, they are outsourcing patch management, reducing their data center footprint and moving to a scalable pay-as-you go cloud model.

    The challenge is that they also became part of the collective, hitching their fate (and uptime) to Microsoft. Thousands of engineers around the world could now affect your day with a single mistake. A critical piece of hardware running Azure Active Directory or Azure DNS could the reason you go get locked out of your data.

    As an individual organization, you were able choose when to roll out patches based on your own business calendar, time zone and risk appetite. You may have opted not to make a major email server update the day before your Monday morning customer conference. Likewise, in a pinch, an admin could troubleshoot most directory and access issues as a local superuser. But now, as a Microsoft 365 subscriber, your personal needs are irrelevant. You simply comply with whatever IT changes Microsoft is making to ‘your’ services.

    Uptime responsibility

    Mimecast’s State of Email Security research found that 60% of global organizations using Office 365 had experienced an outage in the 12 months prior. 11% were severely impacted and 25% moderately. Only 9% of those stated the outage had not impacted their organization.

    This is why any organization moving to Microsoft 365 (or other cloud service) needs to carefully consider the business impact of when that service becomes unavailable. Calculate what would happen to productivity for an email outage lasting an hour, a day – a week. What security and compliance risks could occur as employees started using unsanctioned or consumer IT services to get the job done?

    This is not a new idea and simply part of traditional disaster recovery planning. The issue that many organizations are lurching into the cloud-first world and forgetting they always need a plan B. Every organization has unique security, continuity, availability, backup and data assurance risks that need consideration and mitigation.

    Moving to Microsoft 365 is certainly the right choice for many organizations. Microsoft can consolidate so much for you, but the responsibility for appropriate mitigation is ultimately yours. Every IT leader needs to remember they are still on the hook for their uptime and for this reason I urge you not to compromise on email. Blaming Microsoft will not get you very far.

    Email continuity planning

    Email continuity (and Teams/Slack continuity) is a critical priority for organizations today. These are now primary forms of business communication and collaboration – exaggerated by the pandemic-led remote work world. These services are also major repositories of corporate data and employee productivity can be severely impacted when these services go down.

    Traditional approach to email continuity required expensive replication and mirrored services – often designed for on-premises environments. Backup is also the wrong tool for the job. Backup and archives are integral for data protection and, recovery and compliance but usually are lacking an appropriate Recovery Time Objective (RTO) to stand in as an email continuity service.

    Today’s benchmark for email continuity must provide real-time access to email during primary service outages. IT administrators should be able to choose between manual and automatic activation but also empower individual users to switch to the backup service. Secure access to email and calendar information should be made possible at any time on any device from any location. Finally, once the incident is over, the continuity service needs to automatically synchronize all emails sent or received during the outage.

    Resilience is not futile

    This latest incident is another reminder that relying on a single cloud service isn’t the most effective strategy. Microsoft simply can’t be relied on for availability and the responsibility to mitigate that risks lies with every organization using its service.

    Whether unexpected or planned, downtime for email does not have to mean downtime for your organization. Don't let complexity become your, or even Microsoft’s excuse. It's not that complex to make sure you've got a resilience strategy that assures the ability to recover and continue with business as usual despite an outage.

    Subscribe to Cyber Resilience Insights for more articles like these

    Get all the latest news and cybersecurity industry analysis delivered right to your inbox

    Sign up successful

    Thank you for signing up to receive updates from our blog

    We will be in touch!

    Back to Top