April 19, 2017
What’s Your Contingency Plan for an Exchange Online Outage?This winter was sitting in a hotel room in Boston watching a blizzard fall outside my window. Not the worst situation I could be in (as I sipped my cup of coffee). At some point the news flashed a number at the bottom of my screen for folks to call if the power went out. If the power goes out? Now that would have changed my comfy scenario pretty quick and I jotted that number down just in case I needed a contingency plan. Does the hotel have a generator? Where are the emergency exits (elevators would be out). A moment ago I was enjoying the snowfall, but now I’m thinking ahead because…well…things happen.
Things happen. And it’s smart to have a contingency plan in play to ensure you aren’t just a standby victim waiting for the lights to come back on (or in the case of Exchange Online, for your email to come back up). In the years that Office 365’s Exchange Online has been available there have been major and minor outages of the service each year, often at inopportune times (as if there is an opportune time to lose email). I think of the Microsoft Worldwide Partner Conference (WPC) in Orlando in 2015 where the email service was down for several hours. Or the December 2015 event that hit Europe due to a misconfiguration error of a Microsoft engineer with Azure (which affected the Office 365 customers that rely on Azure for identity management and such). Or the June 30th, 2016 outage that affected some North American customers for up to 9 hours! Last day of the sales quarter!
Some say, “well, that’s the risk of going to the cloud and when things go down, they go down… and you wait!” That may be true of some things. But what if I told you there was an alternative when it comes to Exchange Online. What if I told you that when it goes down (and it DOES go down) your users could continue to work and not even know there was an outage. A pretty nifty trick, especially if you’re the one who proposed the move to Exchange Online and don’t want to have to explain the outage (or lack of ability to do anything other than fold your arms and wait for Microsoft to fix it).
The solution? Mimecast’s Continuity for Exchange/Exchange Online
The way this works is brilliant. When you bolt Mimecast on to the front end of your Exchange or Exchange Online, you basically have the MX records pointing in to Mimecast and then set up send/receive (aka outgoing/incoming) connectors to have mail flow between the two. That allows Mimecast to perform enterprise grade security scrubbing along with an optional archive data bank storing emails coming and going. In addition, Mimecast has their own MTA so in the event a problem occurs on the email server itself (Exchange or Exchange Online) the admin simply has to kick off a continuity event in their Mimecast administration portal and mail flow is now completely handled on the Mimecast side (with a 100% SLA). End users can continue to send and receive email in one of three ways: through Outlook if they have the Mimecast plugin for Outlook, through their Mimecast mobile app and/or through a Mimecast web portal.
One of the biggest challenges facing IT admins these days with regard to availability of the Office 365 suite of services is transparency. It’s often the case that end users start to complain about a loss of services but the IT admin doesn’t see an alert from within their Office 365 admin center. Everything is showing up green, but their end users faces are all red. The IT admin turns to Twitter or Reddit or other social media outlets to try to determine if the problem is on the company side or Microsoft’s side. In Microsoft’s defense there is quite a bit happening on their end and while one customer might be down or a grouping of customers, depending on the extent and type of outage, it isn’t time to throw out a red flag just yet. But for those customers who are down, they need more transparency. However, monitoring this type of outage is becoming increasingly more difficult as Microsoft breaks users into separate pods, ultimately obscuring the true extent of an outage.
To address the need for better transparency, Mimecast is up’ing its game in the continuity space by adding in Continuity Event Management or CEM. One of the key elements to CEM is the ability to monitor your connection to Exchange Online on a continuous basis looking for possible problems. It does this using an ‘organic’ inbound check (can my SMTP server receive mail) and a ‘synthetic’ outbound check (can my SMTP server send mail). In the event of a problem an alert gets sent through SMS or to an alternate email (logically because your primary is down) and you’re basically given a panic button to manage the alert. Push the button, invoke the Mimecast continuity mode for your people, go back to whatever it was you were doing before the alert with the knowledge that your people are fine.
Truth be told, things happen. You know it. Cloud infrastructure breaks down sometimes. If you’ve been impacted by a cloud disruption, you’re not alone. And if you haven’t (yet), you’re not immune. So what’s your contingency plan? What do you do when Exchange Online goes out in whole or in part? If your answer is ‘fold your arms and wait for Microsoft to fix it’ that’s a choice you’re making. It’s not the only choice you have. You could choose to have a plan b. An email continuity solution that can keep your people sending and receiving email, despite the outage.
You have a choice.
See how Mimecast can make email safer for your business. Schedule a demo today!
If there’s one thing we can be sure about it’s that, at some point in the future, almost nobody will manage mailboxes on premises. The dominant players look like being Microsoft with Office 365 and Google with Google Apps, though of course others may emerge.
Not surprisingly, then, pretty much every CIO in the world has taken a look at these platforms and adopted a stance. The stance may involve proactive planning now with a rapid migration in mind, or it might be a case of keeping things as they are until the technology matures further. Or there might be any number of interim steps that will make a migration easier at some point in the future. I would wager that there is no CIO that hasn’t started thinking about migrating email, in its entirety, to the cloud.
For the last few years Mimecast has positioned itself as a companion technology to Microsoft Exchange, optimizing our cloud services to deliver maximum value to on premises or hosted Exchange customers. And now, of course, we’re also providing services for Office 365 customers, in both cloud-only and hybrid environments. Of our 9,000 or so customers, almost all of whom are on some form of Exchange, we are seeing a growing number using Mimecast and Office 365 together. With Office 365, we support very clear use cases that address specific customer needs that can’t be met by Office 365 on its own. It could be a particular compliance or eDiscovery need, or a desire for a ‘cloud-on-cloud’ High Availability solution to protect against downtime.
Office 365 may be the eventual destination for most businesses, but that doesn’t mean there is a crazy rush to migrate there or indeed that it’s the only short to mid-term option. For example, we’re seeing the Managed Service Provider (MSP) market booming, as smaller businesses offload their Exchange infrastructures and move to hosted Exchange suppliers. At the other end of the scale, Exchange 13 is an attractive option for companies who want to keep their mailboxes on-site. And we’re seeing a fair amount of hybrid deployment, with IT moving a subset of users to the cloud, with an independent archive like Mimecast’s giving them the flexibility to toggle mailboxes back and forth between on premises and cloud as they see fit.
But let’s not kid ourselves. These are all interim measures, albeit interim measures that will be very profitable for those organizations operating in the space for some years to come.
The point, I guess, is that we’re all preparing for an Office 365 world. At Mimecast, we are building out and optimizing our Office 365-specific portfolio so the use cases are crystal clear. It’s not simply a question of offering alternative tools to those that Microsoft includes with its Office 365 SKUs, but showing how we offer additional layers of functionality that support specific customer needs. That way, over time, we actually see ourselves becoming an accelerator, or enabler for Office 365 adoption, since we effectively remove short-term barriers to adoption.
Naturally, Microsoft is working hard to add functionality of its own and make Office 365 as robust and feature rich as possible. Many of the ‘gaps’ that Michael Osterman calls out in his paper, Office 365 for the Enterprise: How to Strengthen Security, Compliance and Control, will be filled by Microsoft over the coming years. So does that mean third parties will find it hard to build businesses within this ecosystem? No. In fact, as the platform matures, more use cases will emerge just as happened with Exchange many years ago.
Microsoft will certainly want to make sure that the common elements of customer need are properly served by Office 365 off the shelf, but this is a company, unlike Google, that has always been committed to its partners, and to the creation of a vibrant community of ISVs around its core platforms. Office 365 will be no different, and there will be plenty of room for third parties who can help customers not only see over the short term hurdles, but enjoy a first class, zero compromise cloud experience in the longer term.
Microsoft has changed the way Offline Address Book (OAB) Distribution works over previous versions of the product to remove a single point of failure in the Exchange 2007/2010 OAB Generation design. While this new method of generating and distributing the Offline Address Book has its advantages, there is also a disadvantage which can result in a breach of privacy especially in multi-tenant environments. In this article we will be looking over how OAB Generation worked in the past as opposed to how it works now highlighting both the good and the bad.
Back in May 2009, I published an article entitled “How OAB Distribution Works” which has received a large number of visits and can be found on my personal blog under the following URL link. This article explains in detail the process behind OAB Generation in Exchange 2007 and 2010 and I highly recommend this read to anyone who is not familiar OAB Generation in previous releases of the product.
If you have not read the above article, let’s quickly summarise. In Exchange 2007/2010 every OAB has a mailbox server responsible for OAB Generation. The mailbox server responsible for OAB generation would generate the OAB according to a schedule and place it on an SMB share under \mailboxservernameExchangeOAB. The Exchange 2007/2010 CAS servers responsible for distributing this Offline Address Book would then download the OAB from this share to a folder advertised through Internet Information Services (IIS). Outlook clients then discover the path of the IIS website through autodiscover and download the files located under the OAB IIS folder through HTTP or HTTPS. If you need to gain a more in-depth understanding of this process again I encourage you to read the blog post above.
Now the problem with the above design is every OAB has one Mailbox server hard coded to be the server responsible for performing OAB Generation. The whole point of Exchange Database Availability Groups is to allow mailbox servers to fail and have databases failover to other mailbox servers which is a member of the same Database Availability Group. This presents a single point of failure. In the event the server responsible for generating the OAB was to fail, this OAB generation process would not failover to another server as the OAB is hardcoded to use that specific mailbox server as the OAB generation server. This means until an administrator brings back the mailbox server which failed or moves the OAB generation process for the specific OAB to another mailbox server, the OAB in question will never get updated.
To fix this in development of Exchange 2013, Microsoft needed a method to allow any mailbox server to fail without disrupting the OAB generation process, after all this was the whole idea behind Database Availability Groups – the ability to allow mailbox servers to fail. Instead of spending development time on putting together a failover technology around OAB Generation, Microsoft decided to incorporate the OAB Generation process into Database Availability Groups. This means instead of having one mailbox server generate the OAB and share it out via SMB, the Exchange 2013 server hosting the active mailbox database containing the Organization Mailbox is now the server responsible for generating the OAB. In fact in Exchange 2013, the OAB is now stored in an Organisation Mailbox so in the event a mailbox server fails or a database failover occurs, the OAB will move along with it. This architecture change has removed the OAB generation single point of failure which caused problems for organisations in previous releases of the product.
Whilst Microsoft removed the single point of failure from the generation process of the OAB, they introduced a problem with the distribution process. In previous releases there was a service running on CAS servers known as the Exchange File Distribution Service, a process which downloaded a copy of the OABs from various mailbox servers performing the OAB Generation task and placed the OABs in a web folder available for clients to download. This allowed companies running multiple OABs to provide NTFS permissions on the OAB folders to restrict who is allowed to download the OAB. This is especially useful in Exchange multi-tenant environments to ensure each tenant is allowed to only download the address book applicable to their organisation.
In Exchange 2013 Client Access Servers the Exchange File Distribution Service has been removed and the Exchange 2013 CAS now proxies any OAB download requests to the Exchange 2013 mailbox server holding the active organisation mailbox containing the requested OAB. The Exchange 2013 CAS finds which mailbox server this is by sending a query to Active Manager. As the Exchange 2013 CAS no longer stores each OAB in a folder under the IIS OAB directory, companies can no longer set NTFS permissions on the folders to restrict who has permissions to download each respective OAB. It is also important to note that inside each organisation mailbox there is no means provided for organisations to lock down who can download each OAB through access control lists. This introduces privacy issues for companies who offer hosted Exchange services as it presents a privacy breach. Someone who knew what they were doing and has a mailbox within the Exchange environment could download OABs from other organisations and in result gather full list of employee contacts for data mining purposes. Microsoft’s response to this threat documented in the multi-tenant guidance for Exchange 2013 is for hosting companies to “monitor the OAB download traffic” – in other words there is no real solution to prevent this from happening.
For more information about the Exchange 2013 OAB distribution process I strongly recommend the following article published by the Exchange Product Team.
Clint Boessen is a Microsoft Exchange MVP located in Perth, Western Australia. Boessen has over 10 years of experience designing, implementing and maintaining Microsoft Exchange Server for a wide range of customers including small- to medium-sized businesses, government, and also enterprise and carrier-grade environments. Boessen works for Avantgarde Technologies Pty Ltd, an IT consulting company specializing in Microsoft technologies. He also maintains a personal blog which can be found at clintboessen.blogspot.com.
In chapter one of “Microsoft Exchange 2013: Design, Deploy and Deliver an Enterprise Messaging Solution”, we talk about constraints that may be forced upon us when designing Exchange. One of these constraints may be that we must use either existing hardware or the incumbent virtualization solution. Existing hardware can be a bear of an issue, since if the sizing doesn't fit the hardware, then you don’t really have an Exchange deployment project anymore.
However, virtualization carries with it the promise of over committing memory, disk and CPU resources, which are features deployed by most customers taking advantage of virtualization technologies. Note that over committing anything in your virtualization platform when deploying Exchange is not only a bad idea, it’s an outage waiting to happen.
Virtualization is not free when it comes to the conversion of physical hardware to emulated virtual hardware, and the figures vary between vendors, however you may be looking at a net loss in the range of 5-12 percent across the entire guest’s performance. Coming back to constraints, let us assume your customer – or your company – requires you to virtualize and use VMware as the chosen hypervisor.
Once you've taken into account that you’re virtualizing, you then need to size your guest, as if you’re sizing the real world equivalent of the server. Let’s assume that for arguments sake you end up requiring four cores per server, but you allocated eight, since more cores never hurt anyone, right?
You read the prevailing guidance carefully so you decide to use an existing blade with eight cores, and allocate another eight cores to Exchange, bearing in mind that you've already allocated eight cores to two other applications on the same blade. No fuss, you may think, the guidance states that you should allocate no more than two virtual cores, per physical core. Since you’re a conscientious SysAdmin, you've benchmarked CPU usage on the VMware host and decided that the values are acceptable.
Now it turns out that for some reason Exchange seems to run non-optimally. You decide then to move Exchange to another blade with more CPU's and double the core count within the guest from eight to 16 CPU's, since more CPU's never hurt anyone, right?
Turns out that the expectation for a linear increase in performance is not fulfilled…where do you turn to next?
A good place to start may have been the vendors specific guidance pertaining to Exchange, in this case VMware supplies the Exchange 2010 best practices guide, which states (emphasis added)
Consequently, VMware recommends the following practices:
- Only allocate multiple vCPUs to a virtual machine if the anticipated Exchange workload can truly take advantage of all the vCPUs.
- If the exact workload is not known, size the virtual machine with a smaller number of vCPUs initially and increase the number later if necessary.
- For performance-critical Exchange virtual machines (production systems), the total number of vCPUs assigned to all the virtual machines should be equal to or less than the total number of cores on the ESXi host machine.
While larger virtual machines are possible in vSphere, VMware recommends reducing the number of virtual CPUs if monitoring of the actual workload shows that the Exchange application is not benefitting from the increased virtual CPUs. For more background information, see the “ESXi CPU Considerations” section in the white paper Performance Best Practices for VMware vSphere 5 (http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.0.pdf).
Before “consequently”, the guide briefly introduces VMware's Virtual Symmetric Multi-Processing model, as well as detailing a wait state known as “ready time”. Ready time is the metric revealing why your Exchange workloads in VMware are not benefiting from more processors, assuming that “Ready Time” is consistently high (more than 5%).
The consequence of throwing more vCPU's at a guest than required is that the guest spends more time in “ready time” than is required, as the hypervisor waits for ALL underlying cores which it believes are available to the guest to become available to execute instructions. In other words, the guest OS is ready to process instructions on the processor, however the hypervisor is forcing the guest to wait until all the physical cores are available. This state becomes much worse, as the ratio of vCPU's to physical CPU's increases.
In several chapters we make reference to the Windows Server Virtualization Validation Program (SVVP) and guide you to make sure that your chosen virtualization platform is listed and supported.
VMWare is listed as supported for multiple versions of Exchange and Windows Operating Systems however your server performance is still bad. Does that mean it’s a bad hypervisor?
The point is that VMWare is not a bad hypervisor but not understanding how VMWare allocates CPU resources, as well as not following VMWare's guidance will result in poor performance for your Exchange servers.
Had you followed the guidance, you would have chosen to start with fewer CPUs (you needed four) instead of more CPUs, and following VMware guidance (reducing the number of cores), you would have ensured that you would have allocated one vCPU per physical CPU, thus leading to gratifyingly low ready states.
This is another stark reminder of the fact that we need to read relevant documentation as part of our planning process. All relevant documentation, not just that of the new software we are planning to use.
Nicolas Blank has more than 15 years of experience with various versions of Exchange, and is the founder of and Messaging Architect at NBConsult. A recipient of the MVP award for Exchange since 2007, Nicolas is a Microsoft Certified Master in Exchange and presents regularly at conferences in the U.S., Europe, and Africa.
Nicolas will be running a two day ‘Mimecast Exchange’ training event on the 31st of October and the 1st of November at Microsoft’s Cardinal Place in London. For your opportunity to win a place at the event, please read this blog post about the event.
(Image courtesy of Random Tony)