Anatomy of an Outage – Insight and Learnings
It probably won’t surprise anyone that my capacity to communicate to customers and partners was limited during last Thursday’s problems. I had my sleeves rolled up, with my team, as we worked to restore service from our UK Woking Data Centre. I am hoping that Peter Bauer's blog posts, and the other communications platforms we’ve been using, kept you informed during the incident and in its aftermath.
Firstly, it's important I tell you that our service is fully restored and back to normal operations. There will, no doubt, be a few isolated issues to fix here and there, and we’re continuing to address those now. From a service point of view – all our UK data centres are functioning normally and all processing clusters are fully operational in fully resilient mode. Remote device and application services (MSO and mobile) are up and running normally as are archiving indexes and functions.
It's also important to know that long term data was not damaged during this incident. A very small number of customer emails bounced back to the senders for retry where delivery was not an option.
So – our services are fully recovered, and all of the data we keep for customers is where it should be.
Let me now take this opportunity to outline the key technical facts behind the incident, and the key learnings we’ll be taking forward from here.
We are working through the formal incident report now and we’ll be delivering it directly to customers on Tuesday 21st May.
The incident summarises as follows:
At 10:38BST on Thursday morning, a high availability core switching solution within our Woking UK data centre failed. It's important to know that our platform software didn't cause this issue. Our network engineers tried to revive the switching solution, but finally conceded that it could not be recovered and that the Woking DC was offline. At that point we invoked our full data centre failover process so that we could operate from another UK data centre.
We fully completed the failover process by around 14:00 on the same day and customer emails began flowing some time before that. Three hours is a long time in terms of email backlog, and it took some time for email backlogs to work through so some customers still experienced delayed emails into the evening.
Some customers were only configured to deliver outbound emails via our Woking data centre, so we had to work with these customers to ensure that their settings were correct before we could get their emails flowing again. Also, some residual DNS issues affected the delivery of email, having a small knock on effect on a number of North American customers. DNS issues were resolved over the course of the day.
Will It Happen Again?
Like any cloud vendor, we live in the knowledge that at some point, one of our data centres worldwide can fail. After all – that is why we have several of them in a mirrored configuration for every region. We would have expected to recover from such a failure very quickly by using the redundant capabilities we have built into the platform’s design. This case was different as it is the first time that a data centre could not be recovered within a short period of time.
Moving forward, we continue to assume and plan for the eventuality of a volume data centre failing again for any number of reasons – because it will happen. It's what we do when that happens that matters. We have spent 10 years building software that is able to withstand this kind of catastrophic event – but we need to tweak our services and procedures so that event doesn't cause us or our customers significant pain.
We are scoping out significantly improved failover processes, which we think could reduce the downtime caused by a single DC failure by as much as 90%, and eliminate the knock-on effects entirely.
It’s important to note that the design of our service is good, and we will not need to rewrite significant parts of our code to deal with this kind of incident in future. Most of the systems did exactly what we designed them to do – which is why we had a 3 hour total outage as opposed to a total outage lasting 8 or even 12 hours as we have seen with some other cloud service providers.
As painful as this event has been, it has undoubtedly made us stronger technically. We now know the realities of this kind of scenario first hand. It's a hard way to learn, but it has also set in motion many adjustments that will mean that we can deal with disaster scenarios even more quickly. This gives me confidence in our future ability to deliver our service to customers no matter what.
Finally, I would like to add my apology to Peter’s and thank you all for your patience while we dealt with the incident.