A Tough Email Day
Today Mimecast suffered an HA network hardware failure that shut down services from one of our data centres in the UK. The outage lasted a little over three hours from about 11am UK time to just after 2pm. Afterwards some customers experienced slower service responses due to back log recoveries underway.
I wanted to take this opportunity to say sorry personally and on behalf of Mimecast to our customers and partners affected by this issue today. I also wanted to give some background to the problem and our response.
First of all, I know how critically important email is to you and to your businesses. The importance and value of email and the challenges of running robust email infrastructures were among the main reasons Neil Murray, my co-founder and our CTO, and I started Mimecast just over 10 years ago. We sincerely appreciate the faith and trust that you place in us as your email gateway and your email continuity provider. We make promises to you that we will always be there to deliver your messages. We work day and night to meet that promise and invest extensively in our software development, data centres and infrastructure resilience to meet it. We have teams of people who work around the clock to support the service and support you as our customers and partners.
For three hours today we did not live up to our availability promise. We are very sorry.
Over the last ten years we have not had any significant outages because of our infrastructure and because of the constant scenario planning we conduct to ensure we’re mitigating against any points of failure.
As a cloud vendor, our platform infrastructure works in an active-active model, where communications are handled by all sides of our grid. If there is any unavailability in a component another part of the grid can take over. Failing over an entire data centre happens extremely rarely and we deliberately do it manually as an automatic failover of this scale brings significant risks. The plans we had in place underestimated the time it would take to complete the task. We aim for under 30 minutes, however this one took us over 2 hours.
We will be reviewing this procedure and making sure that we can do it faster - much faster - should we be called upon to do it again.
In terms of next steps, we will of course honour our SLA obligations, and we’ll be in touch proactively with all affected customers on this issue in the coming days. We appreciate the patience that many of our customers have shown during this tough day and we will be working extra hard to ensure it doesn’t happen ever again.
Previous update and email routing information can be found here.