Dressed to the Nines; what your Uptime SLA really means

In short Google claimed its Google Apps service had achieved 99.984% uptime in 2010 and, citing an independent report, went on to say this was 46 times more available than Microsoft's Exchange Server. Microsoft retaliated by saying BPOS achieved 99.9%  (or better) uptime in 2010 and this was in line with their SLA. Microsoft quite rightly protested at Google's definitions of uptime and what should or should not be included.

Google and Microsoft have recently been poking holes in each others' uptime SLAs (Service Level Agreements.) The squabble has been summed up here by Paul Thurrot from Windows IT Pro.

The discussion continues.

Uptime is one of those things included in your service provider's SLA that you never really give much attention to, unless it's alarmingly low: 90%, for example. Most Cloud, SaaS or hosted providers will give uptime SLA figures of between 99.9% (three nines) and 99.999% (five nines). Mimecast proudly offers a 100% uptime SLA.

All of these nines represent different levels of 'guaranteed'  service availability. For example, one nine (90%) allows for 36.5 days of downtime per year. As I said, alarming. Two nines (99%) would give you 3.65 days of downtime per year, three nines (99.9%) 8.76 hours, four nines (99.99%) 52.56 minutes and five nines (99.999%) 5.26 minutes per year. Lastly six nines, which is largely academic, gives a mere 31.5 seconds.

What does all of this mean to you as a consumer of  these services?  In terms of actual service, very little, unless you happen to be in the minority percentage; that is to say everything has gone dark and quiet and you're suffering a service outage. What is much more important is how the vendor treats you in the event they don't achieve 100%. It is hard for any vendor to absolutely guarantee 100% uptime all of the time, so you must make sure there is a provision for service credits or financial compensation in the event of an outage. If not, the SLA is worthless. Any reputable SaaS or Cloud vendor will have absolute confidence in their infrastructure, so based on historical performance a 100% availability SLA will be justifiable. Mimecast offers 100% precisely for this reason.  We have spent a large amount of R&D time on getting the infrastructure right so it can be used to back up our SLA, and as a result we win many customers from vendors whose SLAs have flattered to deceive.


A larger issue perhaps we ought to consider is highlighted by the arrows Google is flinging in Microsoft's direction: namely, how do vendors really define uptime? What sort of event do they class as an outage? Does the event have to occur for any length of time to qualify? Is planned downtime included in the calculation? And so on.

There is no standard with which uptime  is defined and common sense isn't always applied either. In other markets, consumers are reasonably protected from spurious vendor claims by independent third parties like Consumer Reports or Which. Not so with the claims tech companies make regarding the effectiveness of their solutions, and the result is a great deal of spin, which in turn inevitably leads to misinterpretation and confusion.

Fortunately, we're not the only ones to see the need for standards here.  Although it's early days still, you can get an overview of ongoing current efforts at cloud-standards.org.

Google and Microsoft's argument is based largely on differences in measurement rather than any meaningful level of service. In a highly competitive market, any small differentiation can be a perceived bonus (by the vendor) but if we're all using different tape measures to mark our lines, the only reliable way tell who comes out on top is to talk to the long-term customers.