Reported authentication issue results in multiple service outages for cloud giant, causing global productivity disruption.
Were you sweating in your pajamas on Monday as you realized that not only was your Google Nest thermostat down – nearly every service Google provides was in fact down? You were not alone, as IT admins and users around the world were reminded of the danger of over-reliance on a single IT service provider.
During the incident, the Google Workspace Status Dashboard was showing Gmail, Google Calendar, Docs and a host of other services were having issues that prevented authentication of users. DownDetector showed 49,681 peak reports during the outage. A Google spokesperson later said they had experienced “an authentication system outage for approximately 45 minutes due to an internal storage quota issue.”
Even with a relatively short period of downtime, the resulting disruption can be significant – particularly when millions of users around the world are relying on these tools more than ever before. The Wall Street Journal reported how Wayne-Westland Community Schools in Westland, Michigan gave its roughly 9,800 students the day off after a morning of disruption. “This is the new snow day,” said the school spokeswoman to the reporter who also had to file his story via telephone while Google was down.
Thankfully, such large scale failures in Google's systems are rare, but certainly not inconsequential. Google Workspace was introduced in October, replacing the G Suite brand, and touted as “everything you need to get anything done, now in one place.” Google Workspace now includes Gmail, Calendar, Drive, Docs, Sheets, Slides, Meet, Chat — all more tightly integrated than ever before.
Single Point of Failure
Unfortunately, with this tight coupling and shared platform can also come with an increased risk of a cascading failure.
Authentication services should be a well-known potential point of failure to business continuity professionals, and it is not the first time this type of service failure has resulted in widespread disruption. Back in September, Microsoft was plagued by lengthy service issues linked to part of its authentication system, Azure Active Directory, that left a portion of its users locked out of multiple Microsoft cloud-based services. Additional Microsoft outages in October, November and December reiterated the growing productivity problem resulting from the hidden complexity of cloud continuity planning.
These outages can pose a significant challenge to end-user productivity, security and, in some cases, compliance. Using email as an example, some organizations have turned to Google as a cost-effective alternative to Microsoft Exchange Online – available with Microsoft 365 service. Rather than maintaining their own Exchange server in a physical or virtual environment, they trust their cloud provider – in this case Google – to manage their email service.
The problem is that in the race to the cloud more and more organizations, consumers – even governments – are doing away with decades of IT best practice and skipping on providing service redundancy. For decades, typical practice for critical business systems always included a plan B. Two telephone lines, two independent internet service providers, two data archives, a backup generator; the ‘two parachutes’ thinking to help preserve the life of critical business functions.
But these are challenging times of course. Digital transformation projects have been accelerated due to the COVID-19 pandemic and almost all organizations have had to rethink how they collaborate internally and with customers, partners – or even students, as in the above example. And thus, anecdotally, IT teams have been under considerable time and financial pressures to get new tools up and running, often with limited attention – at least upfront – to traditional disaster recovery and even security thinking.
Cloud service providers do build in some of their own internal redundancies – but often concentrate these efforts on data integrity with a Recovery Point Objective (RPO) of zero (i.e., no data loss in the event of any downtime). But there are major gaps when it comes to maintaining the availability of a service that no single provider has yet solved for. While there are great economic advantages of service homogeneity, this introduces the risk of widespread downtime if systems or shared services degrade or fail.
This is where the other critical measure, the Recovery Time Objective (RTO), comes into play. Each organization needs to calculate this for each cloud service the use. The RTO is time and service level within which a business process must be restored after a disaster in order to avoid unacceptable effects from a break in availability.
Google itself offers an impressive treatise on disaster recovery architecture, yet are unable to provide your organization with an always available service. And it's exactly the same for Microsoft, AWS – even Mimecast. The latter provides an email continuity service designed to be deployed when your primary email service is down. You can’t build your own YouTube, but you could ensure you have a backup copy of your training videos hosted there. Likewise, you’re able to own the redundancy for your critical business functions, such as the ability to conduct virtual meetings, calls and send emails.
Only your organization can determine the required level of resilience for each business service or IT application you depend on. But every IT and risk management professional has a role to play in evaluating these risks and making appropriate plans upfront ahead of the next great downtime incident.
No organization is immune to failure and that’s why we use two parachutes – and where appropriate, two clouds.
Want more great articles like this?Subscribe to our blog.
Get all the latest news, tips and articles delivered right to your inbox
You will receive an email shortly