RSS Subscription 167 Posts and 2,643 Comments

Exchange 2010 Site Resilient DAGs and Majority Node Set Clustering – Part 2

Welcome to Part 2 of Exchange 2010 Site Resilient DAGs and Majority Node Set Clustering.  In Part 1, I discussed what Majority Node Set Clustering is and how it works with Exchange Site Resilience when you have one DAG member in a Primary Site and one DAG member in a Failover Site.  In this Part, I will show an example of how Majority Node Set Clustering works with Exchange Site Resilience when you have two DAG members in a Primary Site and one DAG member in a Failover Site.

Part 1

Part 2

Part 3

Real World Examples

In Part 1, I showed a Real World example when you have one Exchange DAG member in the Primary Site and one Exchange DAG member in the Failover Site.  In this Part, I am showing a Real World example when you have two Exchange DAG members in the Primary Site and one Exchange DAG member in the Failover Site.

3 Node DAG  (Two in Primary and One in Failover)

In the following screenshot, we have 3 Servers.  Two are Exchange 2010 Multi-Role Servers; one in the Primary Site and one on the Failover Site.  The Cluster Service is running on all three Exchange Multi-Role Servers.  More specifically, it would run on the Exchange 2010 Servers that have the Mailbox Server Role. When Exchange 2010 utilizes an even number of Nodes, it utilizes Node Majority with File Share Witness.  Because we have an odd number of Nodes, we are utilizing Node Majority and will not utilize a File Share Witness.

So now we have our three Servers; all three of them being Exchange.  This means we have three voters and do not need a File Share Witness as we have a third node.  So the question is, how many voters/servers/cluster objects can I lose?  Well if you read the section on Majority Node Set (which you have to understand), you know the formula is (number of nodes /2) + 1.  This means we have (3 Exchange Servers / 2) rounded down = 1 + 1 = 2.  This means that 2 cluster objects must always be online for your Exchange Cluster to remain operational just like if we were utilizing 2 DAG members with a File Share Witness.

But now let’s say one of your Exchange Servers go offline.  Well, you still have at least two cluster objects online.  This means your cluster will be still be operational.  If all users/services were utilizing the Primary Site, then everything continues to remain completely operational.  If you were sending SMTP to the Failover Site or users were for some reason connecting to the Failover Site, they will need to be pointed to the Exchange Server in the Primary Site.

But what happens if you lose a second node? Well, based on the formula above we need to ensure we have 2 cluster objects operational at all times.  At this time, the entire cluster goes offline.  You need to go through steps provided in the site switchover process but in this case, you would be activating the Primary Site and specify a new Alternative File Share Witness Server that exists in the Primary Site so you can active the Exchange 2010 Server in the Primary Site.  The DAG won’t actively use the File Share Witness but you should specify it anyways because part of the Failback process is re-adding the Primary Site Servers back to the DAG once they become operational. And once you re-add the second DAG node, you now have two DAG members in the DAG which will want to switch the DAG Cluster into a Node Majority with File Share Witness which is why you need to still specify a File Share Witness.

But what happens if you lose two nodes in the Primary Site? Well, based on the formula above we need to ensure we have 2 cluster objects operational at all times.  At this time, the entire cluster goes offline.  You need to go through steps provided in the site switchover process but in this case, you would be activating the Failover Site and specify a new Alternative File Share Witness Server that exists (or will exist) in the Failover Site so you can activate the Exchange 2010 Server in the Primary Site.   The DAG won’t actively use the File Share Witness but you should specify it anyways because part of the Failback process is re-adding the Primary Site Servers back to the DAG once they become operational.

Once the Datacenter Switchover has occurred, you will be in a state that looks as such.  An Alternate File Share Witness is not for redundancy for your 2010 FSW that was in your Primary Site.  It’s used only during a Datacenter Switchover which is a manual process.

Once your Primary Site becomes operational, you will re-add the Primary DAG Server to the existing DAG which will still be using the 2010 Alternate FSW Server in the Failover Site and you will now be switched into a Node Majority with File Share Witness Cluster instead of just Node Majority.  Remember I said with an odd number of DAG Servers, you will be in Majority Node Witness and with an even number, the Cluster will automatically switch itself to Node Majority with File Share Witness?  You will now be in a state that looks as such.

Part of the Failback Process would be to switch to a FSW Server in the Primary Site.  Once done, you will be back into your original operational state.

Now the final step of the Failback Process would be to re-add your final remaining DAG Member in the Primary Site.  Once done, your cluster will switch back into a Node Majority Cluster and will no longer be utilizing the FSW.

As you can see with how this works, the question that may arise is where to put your the majority of your Exchange DAG Members?  Well, it should be in the Primary Site with the most users or the site that has the most important users.  With that in mind, I bet another question arises?  Well, why with the most users or the most important users?  Because some environments may want to use the above with an Active/Active Model instead of an Active/Passive.  Some databases may be activated in both sites.  But, with that, if the WAN link goes down, the Exchange 2010 Server in the Failover Site loses quorum since it can’t contact at least 1 other cluster object.  Again, you must have two cluster objects online.  This also means that each cluster object must be able to see one other cluster object.  Because of that, the Exchange 2010 Server will go completely offline.

To survive this, you really must use 2 different DAGs.  One DAG where the majority of your Exchange 2010 DAG Members is in the First Site and a second DAG where the majority of the Exchange 2010 DAG Members is in the Second Site.  Users that live in the First Active Site would primarily be using the Exchange 2010 DAG Members in the First Active Site.  Users that live in the Second Active Site would primarily be using the Exchange 2010 DAG Members in the Second Active Site. This way, if anything happens with the WAN link, users in the First Active Site would still be operational as the majority of its Exchange 2010 DAG Members for their DAG is in the First Active Site and DAG 1 would maintain Qourum.  Users in the Second Active Site would still be operational as the majority of its Exchange 2010 DAG Members for their DAG is in the Second Active Site and DAG 2 would maintain Quorum.

Note: This would require twice the amount of servers since a DAG Member cannot be a part of more than one DAG.  As shown below, each visual representation below of a 2010 HUB/CAS/MBX is a separate server.

The Multi-DAG Model would look like this.

 

Share

19 Responses to “Exchange 2010 Site Resilient DAGs and Majority Node Set Clustering – Part 2”

  1. on 12 Aug 2011 at 11:19 amZachary Loeber

    Well explained good sir!

  2. on 02 Sep 2011 at 10:39 amHons

    Well explained, but unable to implement!
    Server that already a member of DAG1, can not be a member of DAG2.

  3. on 02 Sep 2011 at 7:33 pmElan Shudnow

    Hence why I show twice the amount of servers.

  4. on 03 Sep 2011 at 6:56 amHons

    Got what you mean. Thanks a lot.

  5. on 10 Sep 2011 at 12:27 amCas in contact

    Nice example and clear explanation – thank you.

  6. on 13 Sep 2011 at 1:06 pmAlastair

    Excellent article but can you provide more details about the datacenter switchover process when the primary site is down?

  7. on 10 Nov 2011 at 11:23 amquestion

    Is it normal for the cluster to go offline for a second when switching between Node majority and FSW?

    We have the exact same setup with two mailbox servers and a FSW in one site and another mailbox server in a remote location.

    Every time the WAN link goes down the cluster service restarts and Outlook is disconnected for a split second.

  8. on 12 Nov 2011 at 12:22 amElan Shudnow

    As long as you maintain Quorum, the cluster.exe should never go down. And you don't switch between node majority and Node Majority with FSW when a node goes down. The quorum model changes when servers are removed from the cluster/DAG completely.

  9. on 09 Jan 2012 at 10:55 pmDaveP

    What I don't see here is what happens with DAG1 when you loose active site 1 for a prolonged period of time. Is Quorum maintained with the single server remaining at active site 2 and mail service uninterupted . We would seek to use the secondary mx record to ensure mail flow and use ADSl and webmail to access the mailbox externally via webmail to provide a semblance of service until the primary site has recovered.

  10. on 20 Jan 2012 at 1:07 pmElan Shudnow

    Well part of the failover process is to remove the primary site from the cluster so that you maintain quorum with just your failover DAG members. So to the DAG members in the failover site, they are the only servers in the DAG now and that's why you can maintain quorum. The only potential problem is these DR servers still think there is a copy maintained in the primary site. Because of this, if you have backups occuring in the DR site, logs will not be truncated since there is a copy maintained in the primary site. You could, however, break the copy to the servers that were in the primary sites and now log truncation will occur. When the primary site comes back up, you will need to reseed the databases back to the primary servers.

  11. on 30 Jan 2012 at 6:14 amJBB

    I have a 3 node DAG and have an active/passive setup. 2 MBX nodes are in the primary site and 1 MBX node is in the secondary site. All the users connect to the primary site.

    After reading through this part, I am confused if I need the FSW in the primary site or not. If the MBX goes down on the secondary site, the 2 MBX servers in the primary are still active as they still have majority. But if one of the MBX servers in the primary site also goes down, then Exchange cluster will fail in the primary site as there would not be any majority.

    So in order to get majority in the primary site, are you suggesting to have FSW in the primary site? How many servers would eventually be there once the server that was down in primary site has been brought online? Please keep in mind that I have active/passive configuration and all the users would be connecting to primary site at all times. Can you please clarify on this?

  12. on 31 Jan 2012 at 9:05 amElan Shudnow

    You won't be using the FSW since you'll have an odd number of nodes. You can see in the Visios (and I do mention it in the text) that there is no FSW due to there being an odd number of nodes. You would still specify a FSW though as you only temporarily use the FSW as you build the DAG when having the even number of nodes or during failback as you add the primary site's DAG members back into the DAG when restoring service to the Primary datacenter. Once you add that third node back, the cluster automatically switches to Node Majority instead of Node Majority with Witness when you only had 2 nodes.

  13. on 03 Mar 2012 at 11:58 amJsaz

    Hell Elan,
    Followup to the question and ur reply above, the clarification required here is what if my production setup has 2 mbx at primary site and 1 mbx at DR site. In this case i will have odd no and no need to configure FSW. however the catch here is incase if any one of the mbx servers in Primary site is down than my cluster will go offline..if i want to remove this problem than whats the recommendation ? adding one more mbx server at primary location or suggest.Because ideally with 2 mbx server in primary location i sud get high availability.

  14. on 13 Mar 2012 at 8:05 pmJeffy-g

    Élan, seriously…your site is great.. Between you and Jeff Shertz, you both have helped me understand so many Exchange 2010 and LYNC intracacies. Thank you so much for explaining things so thoroughly. You have been an excellent resource.

  15. on 18 Apr 2012 at 4:20 amJouhar

    I have two MBX server on PR and One MBX on DR, FSWs are configured on both sites, when PR site is down, DR cluster fails, what can be the reason?

  16. on 24 Aug 2012 at 1:13 amRajendra Sonawane

    Excellent.. FSW is required and work only when we have even no of Nodes.

  17. on 24 Aug 2012 at 1:16 amRajendra Sonawane

    If I am deleting one of the database from Active node from DAG, Will it delete copy of the same database from passive copies and logs. Please explain the process.. Thank you in advance

  18. on 24 Aug 2012 at 8:31 amElan Shudnow

    When you delete a database, it removes it from the active as well as removes the copies on any other server. You would still have to go to the file system on each server that had a copy and delete it from the filesystem.

  19. on 07 May 2013 at 2:13 amnatinkel.pl

    Like a parent, you should be aware the threat of
    online predators is real. Salford had a doing well music scene in front
    of Madchester happened.

Trackback this post | Feed on Comments to this post

Leave a Reply