RSS Subscription 168 Posts and 2,769 Comments

Archive for March, 2010

Exchange 2010 RTM High Availability Load Balancing Options

With Exchange 2010 comes many advantages in the HA realm.  One of them is the ability to connect to the Client Access Server for RPC.  This means, when a Mailbox Server does a *over (failover or a switchover), the user is still connected to their RPC Endpoint.  You can also create a Client Access Array which load balances your RPC Endpoint on your CAS Servers.  Lots of information on the RPC Client Access Server here and here.  So what options are available for load balancing this new RPC Client Access Array and at the same time, load balancing all our other services?  And what are the pros/cons of each method?  If you want to know, read on…

Exchange Load Balancing Options

In Exchange 2007, if you wanted any type of HA, you needed at least four servers.  2 for CCR Nodes and 2 for HUB/CAS Nodes.  The reason why you cannot have 2 nodes altogether is that CCR Nodes were limited to the Mailbox role only.  For an Exchange Site, you need to always have at least the  HUB/CAS/MBX Role  for that site to be operational.  In Exchange 2010, more options are now available.  You now have something called Database Availability Groups (DAGs).  These DAG members can contain all Exchange roles (HUB/CAS/MBX/UM) but still may not contain the Edge Transport role.

There is a problem though.  There is a Windows limitation that allows you to install Windows Network Load Balancing on a server that also contains Failover Clustering Services. So while we can now have 2 Exchange 2010 Servers, we need a way to load balance the CAS role to provide High Availability for the following CAS Services:

  • Outlook Web App (formerly Outlook Web Access) (HTTP Traffic)
  • Exchange Control Panel (HTTP Traffic)
  • Exchange Web Services (HTTP Traffic)
  • Exchange ActiveSync (HTTP Traffic)
  • Autodiscover (HTTP Traffic)
  • Offline Address Book (HTTP Traffic)
  • Outlook Anywhere (HTTP Traffic)
  • RPC Client Access (RPC  Traffic)

There are a few options for load balancing.  The first is the ability to use ISA.  The problem here, is that ISA can only load balance HTTP-based traffic.  If you take a look at the bulleted list above, you can see that RPC Client Access Service is RPC Traffic which means that ISA cannot load balance this traffic.  We have a few load balancing options then:

  1. 2 Multi-Role DAG Members and Hardware Load Balancers – Utilize 2 Multi-Role DAG Members (MBX/HUB/CAS).  Use a hardware load balancer to load balance all of the bulleted items above including the RPC Client Access Service using an RPC Client Access Array which load balances port 135 for the RPC Endpoint Mapper and 1024-65535 ports.  Typically, since you are using High Availability, this means that you would most likely want to have 2 hardware load balancers.
  2. 2 DAG Members, 2 HUB/CAS Servers, and Windows Network Load Balancing Utilize 2 DAG Members (MBX).  Use 2 HUB/CAS Servers with Windows Network Load Balancing.  Windows Network Load Balancing will load balance all of the bulleted items above including the RPC Client Access Service using an RPC Client Access Array which load balances port 135 for the RPC Endpoint Mapper and 1024-65535 ports.
  3. 2 DAG Members and DNS Round Robin Use 2 Multi-Role DAG Members (MBX/HUB/CAS).  Use DNS Round Robin to achieve a “poor man’s solution” type of load balancing.  With this scenario, you will not have automatic failover for the RPC Client Access Service.  You will essentially create two A Record for the RPC Client Access Array; one pointing to the first multi-role DAG Member and one pointing to the second multi-role DAG Member.  You will most likely want to lower the TTL values of these DNS records to 5 minutes so if a failure does happen, you can remove one of the A records and the clients will flush their DNS cache within 5 minutes time.
  4. 2 DAG Members, ISA/TMG/UAG, and either Hardware Load Balancing or DNS Round Robin Use 2 Multi-Role DAG Members (MBX/HUB/CAS).  Use ISA/TMG/UAB to load balance all HTTP items from the bulleted list above. The issue here is that now with Exchange 2010, for mailbox access, users connect to the Client Access Server for their RPC Endpoint.  To make this redundant, we create an RPC Client Access Array.  This RPC Client Access Array can be load balanced through a hardware load balancer, DNS Round Robin, or Windows Network Load Balancing.  ISA/TMG/UAG cannot load balance non-HTTP Traffic.  So if you have ISA/TMG/UAG, you can still use it to load balance all HTTP Traffic but you would still need to use a Hardware Load Balancer, DNS Round Robin, or Windows Network Load Balance to load balance the RPC Client Access Array.  The example picture below will show the use of UAG with a Hardware Load Balance mix.

Exchange Load Balancing Options and their benefits

Taking a look at the above list of options, we can use several different options including Windows Network Load Balancing, Hardware Load Balancing, and DNS Round Robin. Each has their pros and cons in terms of cost and functionality.

Hardware Load Balancing

Hardware Load Balancers can have the most capacity in terms of user connections.  But for SMBs, you won’t have to worry about load.  The load is more for very large organizations.  In fact, Microsoft recommends that if you are going to require over 7 HUB/CAS Servers in a load balanced farm, to use Hardware Load Balancers instead of Windows Network Load Balancing.  Hardware Load Balancers are also the most expensive option.

Hardware Load Balancers do have the best functionality from a perspective of Client to Server Affinity depending on the vendor used.  For example, we can use multiple affinities and have fallbacks to a specific affinity of our preferred affinity fails.  For example, we can set up up our hardware load balancers to use the following affinity in terms of preference:

  • Existing Browser-Based Cookie
  • Hardware Load Balanced created cookie
  • SSL Session ID
  • Source IP

The goal here is to make sure that every user is load balanced evenly and that automatic failover can occur quickly and smoothly.

Windows Network Load Balancers

Windows Network Load Balancers do not achieve as much capacity in terms of user connections as a Hardware Load Balancer, but they can still handle a lot of connections.  Windows Network Load Balanced farms can use as many as 8 CAS Servers without suffering a performance degradation.  In order to have the need for 8+ CAS Servers, you’ll need to have many users (tens of thousands). Windows Network Load Balancing is built into Windows Server and therefore, it’s a large cost savings in comparison to purchasing hardware load balancers.

Windows Network Load Balancers do not have as good of functionality of Hardware Load Balancers from a perspective of Client to Server Affinity.  For example, we only have one affinity method.  That method is Source IP.  The downside to using Source IP is if you have a lot of connections coming from a NAT’d Source IP. This means that all of these connections will end up hitting the same Client Access Server as again, the only Affinity Method a Windows Network Load Balancer has is Source IP.

Most likely, if you don’t have the need for more than 8 CAS Servers, Windows Network Load Balancing will suffice for you needs.  It’s cheap, comes with Windows, and does its job.

ISA Server, TMG, or UAG

ISA/TMG/UAG Servers to have more capabilities than Windows Network Load Balancers.  The one downside to them is that they cannot load balance RPC Traffic.  Because of that, you can still use ISA/TMG/UAG to load balance your HTTP traffic, but you’ll still need a Hardware Load Balancer or a Windows Network Load Balancer to load balance your RPC Client Access Array.

ISA/TMG/UAG do scale better than Windows Network Load Balancing but not as well as a Hardware Load Balancer.  ISA/TMG/UAG does not cost as much as a Hardware Load Balancer but is more expensive than Windows Network Load Balancing.  ISA/TMG/UAG also has the capability to do Load Balanced created cookies as well as Source IP Affinity depending on the protocol ISA/TMG/UAG is publishing.

Another upside to using ISA/TMG/UAG is that they can do pre-authentication.  This means that if a server goes down in which a client has affinity to, ISA/TMG/UAG still contains the authentication context of the user and automatically re-authenticates to the new Client Access Server.

DNS Round Robin

DNS Round Robin scales just as high as Hardware Load Balancers because the connections will just go directly to the Client Access Servers.  If anything, it has the highest scale as you don’t have anything in the middle doing anything with the connections.  It’s also free to use!  But in this case, free is not necessarily good because you lose a lot of functionality.  Hardware Load Balancers, Windows Network Load Balancers, and ISA/TMG/UAG all have the capability to detect server failures and automatically stop sending to the server and direct all traffic to a server that is operational.

DNS Round Robin has no automatic server failure detection.  If a host goes down, an Administrator will need to realize it, remove the DNS A/HOST Record for the server that went down, and then clients will have to wait for the TTL value on the old DNS record to expire.  When that happens, the client will begin connecting to the proper server. So you save a lot of money going with this option, but you lose all automation and gain downtime instead.

  1. 2 DAG Members and DNS Round Robin Use 2 Multi-Role DAG Members (MBX/HUB/CAS).  Use ISA to load balance all HTTP items from the bulleted list above. Use DNS Round Robin to achieve a “poor man’s solution” type of load balancing.  With this scenario, you will not have automatic failover for the RPC Client Access Service.  You will essentially create two A Record for the RPC Client Access Array; one pointing to the first multi-role DAG Member and one pointing to the second multi-role DAG Member.  You will most likely want to lower the TTL values of these DNS records to 5 minutes so if a failure does happen, you can remove one of the A records and the clients will flush their DNS cache within 5 minutes time.
Share

Exchange 2010 RPC Client Access Service and Multiple Sites

A common question I see out there is if the RPC Client Access Service (including Client Access Service Arrays) can access databases in other sites. The answer is, yes. Let’s take a look at a couple scenarios.

Scenario #1 – Full Site Failure

Let’s say you have a Client Access Server Array called array.domain.com.  Primary Site goes down.  As a part of the manual site switchover process, you must update the DNS records in your Primary Site to point to the CAS infrastructure at your DR Site.  One out of several DNS records you change will include the CAS Array. You change array.domain.com to point to DRSiteCAS instead of PrimarySiteCAS.  The client (after the DNS record flushes – recommended for TTL value to be 5 minutes for DNS records in site resilient solutions) will then start to connect to the DRSiteCAS which will then access the database in the DR Site.

Scenario #2 – Server Failure(s) in Primary Site and Disabling Automatic Activation for Databases and Servers

In the case where all database copies go down in the Primary Site, your databases can automatically failover to the DR Site as long as you allow automatic activation on the DR Servers (yes, you can turn off automatic activation on databases and servers) and as long as you still have Majority for your Quorum. In this scenario, the RPC Client Access (and array) can access the mailbox databases that are mounted in the DR Site.

Automatic Activation

As I just eluded to above, it is possible to turn off automatic activation on databases and servers. There is something called Database Activation Policy.  Let’s say you wanted to disable a specific database from being considered in the Automatic Activation Process.

You can use the following command to prevent the database from being considered in the Automatic Activation Process:

Suspend-MailboxDatabaseCopy -Identity DB1\MBX2 -ActivationOnly

This example resumes the copy of the database DB1 on the server MBX2 for automatic activation:

Resume-MailboxDatabaseCopy -Identity DB1\MBX2

This is also possible to do at the mailbox server level using the Set-MailboxServer cmdlet.  You can use the following command to prevent any databases on a specific mailbox server from being considered in the Automatic Activation Process:

Set-MailboxServer -Identity MailboxServer -DatabaseCopyAutoActivationPolicy Blocked

This example resumes all database copies on the mailbox server “MailboxServer” for automatic activation:

Set-MailboxServer -Identity MailboxServer -DatabaseCopyAutoActivationPolicy Unrestricted

Example

Let’s say we have 6 DAG Servers with 4 in the Primary Site and 2 in the DR Site with no modifications to the Automatic Activation Policy (DAG Servers in the DR Site can automatically mount databases).  Let’s say, we currently have a lack of funds for storage which prohibit the ability to have mailbox database copies on all servers.  So PrimarySiteMBX01 and PrimarySiteMBX02 in the Primary Site are mirrored in terms of mailbox database copies.  PrimarySiteMBX03 and PrimarySiteMBX04 in the Primary Site are mirrored in terms of database copies.  PrimarySiteMBX01 and PrimarySiteMBX02 are mirrored with SecondarySitMBX0102 in the DR Site and PrimarySiteMBX03 and PrimarySiteMBX04 are mirrored with SecondarySiteMBX0304 in the DR Site.

To make it a bit more clear, the following image shows database distribution.  You can see there are 6 nodes and 3 copies of each database.

Should PrimarySiteMBX01 and PrimarySiteMBX02 go down (as illustrated below), SecondarySiteMBX0102 can automatically mount the database because majority is still there for quorum.  In this case, the RPC Client Access Array in the Primary Site will still successfully be able to provide mailbox access to the databases mounted on SecondarySiteMBX0102 in the DR Site.  This is one of the nice things I like about Exchange 2010 High Availability, is that if your DAGs go down, you can allow the copy in the DR Site to automatically activate (provided the Database Activation Policy as described above allows it to automatically mount) whereas in Exchange 2007, you had to manually activate any SCR copy.

Exchange 2007 and Exchange 2010 Clusters both use Majority Node Set Clustering.  This means that 50% of your votes (server votes and/or 1 file share witness) need to be up and running.  With DAGs, if you have an odd number of DAG nodes in the same DAG (Cluster), you have an odd number of votes so you don’t have a witness.  If you have an even number of DAGs nodes, you will have a file share witness in case half of your nodes go down, you have a witness who will act as that extra +1 number.

So in this scenario, we have 6 votes from the servers plus 1 witness from the file share witness totaling 7 votes.  This means we can have up to 3 servers fail and our cluster will still be online.  This is because if you are in the scenario where we 7 votes, if 3 go down that leaves us with 4 votes which satisfies the 50% + 1 majority rule. Because of this, we still have majority and our quorum and cluster are still fully operational.

Now when exactly would we have to do a manual switchover?  Well, there’s a couple cases.  The first would be if your Primary Datacenter has a complete outage.  This may be due to power failure, environmental disaster, etc…  The other is because all Primary Datacenter DAG members go down or just enough servers go down (again, 50% + 1 voters must be up which means if we lose more than 3 machines (includes FSW), our entire cluster goes offline.  In this case, you’ll have to do a manual datacenter switchover.  You’ll move over all services to the secondary datacenter including changing the RPC Client Access Server FQDN to point to the single CAS Server or the standby VIP that publishes RPC across multiple Secondary Datacenter CAS Servers.

Share